Riding the inevitable chaos wave
Software development can be incredibly complex. The magic of delivering software at times rivals that of the magic of the underlying technology itself. At Workiva we are working to ensure that we are all able to deliver high quality software that drives value for our customers at an appropriate speed. A large part of this effort is in identifying and addressing the risks associated with our development efforts. As we have worked on spreadsheets there are two approaches that have helped us ride the numerous chaos waves that hit us as one of the first teams to make a new HTML5 and docker based systems available to customers in production. One, a focus on delaying making commitments and two, a delivery oriented focus.
Delaying commitments
Making decisions is hard because they can have long term effects on the future of the product. Should we use XYZ language? What about our database? The risks associated with those choices can be incredibly difficult to define, manage, mitigate or even evaluate. What if you just didn’t make the choice? What if you make a smaller commitment so you could make a stronger one later?
“Where are you going to store the data?”
The Datatables team formed in early 2014 to give full attention to the Datatables product. Early effort was focused on the calculation engine and the user interface capabilities. The important choice they made was focusing on the calculation engine capabilities and service capabilities (like end points) instead of where the data would be stored. This allowed them to build out the application and the business logic to get the full cycle of creating and manipulating data quickly. Since we were unsure of where the data could be stored (Datastore? SQL?) this made sense. We made a small commitment to a lower level API knowing we could come back and change things without having to rewrite the entire application
The key strategy is acknowledging a decision needed to be made and making incremental ones before fully committing. Yes, a database or datastore was needed at some point. Delaying allowed that decision to be carefully vetted in the context of a working calculation system. Even once IAPI (my current team) joined and the focus shifted to building out the real persistence layer we maintained that separation of making the persistence layer an interchangeable piece to ease in evaluating different solutions for storing data. It was more work but it allowed us the advantage of being able to evaluate our system as a whole. We abstracted the query, write, and byte storage systems through interfaces.
As we continued to build out the product we kept a strong isolated process approach. Need to write new service layer? Easy, write the handlers and wire it up with the orchestrator (we did this at least three times). Need a new encryption method? No problem, change the persistence handlers. Switch SQL based databases? Write a new implementation of the Query Interface and pick it in config.
“What is your plan for encryption?”
“What kind of encryption do we need to have and when do we need to have it? Pre-Prod? Beta?”
“Not sure”
“Who would know that? Why do we need to do that right now? What is our next step?”
“We’ll worry about encryption when we have a concrete answer”
Often along the path to production questions will come up, decisions will appear to have to be made but it is important to ask “Why right now?”. If you’re exploring a new product do you need the data encrypted? You should be mindful of decisions but you do not need to commit to a solution. You can first build solutions that focus on the code that helps validate initial assumptions which is most likely your business logic instead of your communication or storage logic. Make stage appropriate incremental decisions.
This strategy allowed us to ride disruption from other teams’ decisions. Things like, where we were going to be deployed, what our communication framework was going to look like, are we using MySQL or RDS, S3 or Google Cloud Storage etc. Linking (our data synchronization system)? How is linking’s API going to work? How do we authenticate users? Permissions? It was very helpful in working with other teams and made our integration phases much simpler. If you are uncertain what you are using is going to be used in production, abstract it away and move on. This also eases in doing contract based testing using “Test Doubles” of the underlying system (think Mock/Stubs etc.).
Delaying commitment allowed us to have the abstractions in place that trying out a proof of concept on AWS was only a few days work during LINK (our internal dev conference). Our experience there also helped us work with the larger collection of teams to drive data for the move to AWS decision. We also had links creating and showing up in the Spreadsheets dart code before we integrated with the linking system and by us building our linking abstraction we helped answer critical API questions for the linking team.
Focusing on Delivery
One of our guiding principles was asking ourselves what we could do to release our code faster. Another was asking what we needed to do in order to be in production with customers as a first class product.
“How do you deliver your code to production?”
“Who needs to be involved in a release?”
“How many releases are done each week?”
When we started on spreadsheets (IAPI) in Summer of 2014 we were focused on reaching production but were very aware there were some major decisions that needed to be made across the company on how that was going to work exactly. At that stage we focused more on stable releases of the code base between the teams versus releasing to a production environment as that was undefined. However, that said we still aimed at production and because of that focus we released a version of Spreadsheets in January 2015 for beta customers only.
Our first release into Harbor (our cluster/container management system) in early fall of 2015 took roughly two weeks to get up and running, our initial release to our sandbox environment took two days, our initial release to the production cluster took 2 hours. After each initial release we looked at what had caused delays and either improved the process or we built more automation to help us not make mistakes. Along the way we added things like seed config files for each domain, wrote run book, after run book, after run book and built more sophisticated health checks to allow us to quickly validate configuration and connection issues.
To improve release times we documented our release process and rotated the responsibility through the team so everyone understood what was involved and would be able to cut a release if needed. We communicated that plan to RM and involved them in its development. Internally we started with simple documentation on how to set up our system which evolved into our current readme.
We also conducted quality reviews with our Test Engineers to ask ourselves “Why do we test this?”, “What is covered too many times and slowing down our CI?”, “How can we be confident in this abstraction and not require a skynet test?”, “How can we get feedback faster as developers that something is broken?”. These helped us keep our automated testing systems lean. At one point we were almost 1 full hour of Skynet based testing with another 30 minutes of automated cluster testing on top of that. We want our builds to be nimble so we can address production problems quickly.
“What is your deployment strategy?”
Relentless documentation helped us answer and address questions as we slowly worked out what the actual deployment strategy was going to be for Gen2 production systems. When we first released to production for beta customers on GCE in January of 2015 we only had an inkling of what we would eventually end up needing. We moved from there to work with what would eventually become Harbor and into the Docker / containerized realm. Documentation also helped field questions across the board about our system, how it behaved, and what particular major architectural changes were going to be.
“We have a ton of questions”
Taking time to ensure Cloud Operations, Infrastructure Engineering and Management understand what you are trying to accomplish and how your system works will tremendously improve your ability to deliver your product to production. To help us work with other teams we have a habit of making slides to help explain (and document in the process) what we are trying to accomplish and then taking time to meet with different groups to answer questions. Whether you are explaining in detail how something works, the overall architecture or just your own crazy idea making it easy to discuss can be critical to your success. I recommend working with Security early to define what requirements you need to follow with your data regarding encryption and data recovery. Preparing documentation up front for them detailing your endpoints and persistence layers can speed up your security review.
“What do we need to do to get into production?”
Asking people what they want is always dangerous but it is almost always better to know upfront what you need to do before you try to get that final SOC1 +1 to release your code and find out there are a number of boxes unchecked. You may end up writing things like SLAs, support documentation, building support tooling, using new metric systems, or any number of things. If you ask questions early then you can have time to adjust and mitigate the risks in those requirements. Always remember that those things help people outside of your team help your team once your product enters production.
“What are the risks with bringing Datatables to production by the end of December 2015?”
Sometimes in order to align across large groups we need to have an understanding of the risks with pushing for an earlier than expected release. Sometimes you have to poll the other groups and be the motivator to bring that information together to help everyone understand each other’s risks. This is an opportunity to also detail out what will not be done if we make a decision. What work can we live without? What work can we suspend? Will changing direction cause us to have to come back to something later? If we do not do something, what is the consequence?
All along the way we have all been learning and detailing how we bring a new product to production. We created a development cluster as a place to experiment. People have built automated new systems to assist with compliance and make the process even smoother. Quality Assurance has been right there along the way building new testing and load generation frameworks. Infrastructure, Reliability and Cloud Operations have also built new and numerous systems in tandem with us (Harbor specifically.).
Even the Finance department at Workiva has started working closer with R&D and Infrastructure to facilitate more understanding around costs. To help with some concerns we built a rough cost model of projected costs of all our Service calls, storage, drones etc. Legal and Sales worked through creating our new beta agreements to allow customers to use Sandbox and detailing out how data could be lost there and how our SLDC (systems development life cycle) would apply.
Writing software is risky and difficult and together as a company, a team, as friends we need to help each other ride the waves of chaos and find repeatable patterns of success.