In a previous blog article, I mentioned how I was getting back to my programming roots and reading The Principles of Product Development FLOW: Second Generation Lean Product Development by Donald G. Reinertsen. I am in the process of reviewing each chapter. I have already posted my reviews of:
- Chapter 1. "The Principles of Flow"
- Chapter 2. "The Economic View"
- Chapter 3. "Managing Queues"
- Chapter 4. "Exploiting Variability"
- Chapter 5. "Reducing Batch Size"
- Chapter 6. "Applying WIP Constraints"
Here is my review of Chapter 7, "Controlling Flow Under Uncertainty."
When you read the title, "Controlling Flow Under Uncertainty," the first thing that probably comes to mind is eliminating bottlenecks (i.e., controlling flow) and putting out unexpected fires (i.e., uncertainty). Actually, there's a lot more to it than that. Avoiding bottlenecks typically requires action at a point before where a bottleneck occurs. Although unexpected work can include the need to generate a hot fix, it can also involve determining that the work is more involved than originally anticipated.
Chapter 6 covered applying work in process constraints so that projects avoid having too many tasks in progress. Though this is great for allowing a software development team to focus on fewer tasks at hand, it does not address "the accumulation of variances in concatenated process steps." [page 169] In other words, we also need to fix issues in the work we are currently actively involved in. As such, Chapter 7 proposes that the strategy for doing so involves getting projects into a regular cadence to improve workflow. Doing so reduces overhead and allows project checkpoints to occur at regular intervals instead of on targeted milestone completion dates. In other words, instead of checking projects at the end of phases such as design complete or code complete, evaluate progress at the end of each month to determine where corrective actions are needed. These regular checkpoints help with synchronization where disparate activities have to come together. The book proposes a network-based approach to project management where collections of resources have the capacity to respond to large and small project tasks based on availability.
Here's a recent story shared by Nathan Murith, Software Development Manager on BIM360.
It's Monday morning, and we're already hearing buzz (read nervousness) around our upcoming, end-of-week, quarterly release. Our releases typically involve web components and apps, desktop connected applications, web services, mobile apps — the typical Autodesk setup. This Monday almost immediately turns into a "let's schedule meetings for the rest of the week so we can discuss all remaining issues and coordinate the release" kind of day. We go through our nth bug triage session, addressing enormous Jira backlogs, trying to figure out which is the last bug that is truly important so that we can fix and squeeze it into this release. And of course, we have been in code freeze for the past 2 weeks, which gave our small army of QA personnel time to test the code we are just a few days from releasing. Work late. Fix that last bug. Push the code. Validate the fix. Find new bugs. Log them in Jira. Triage the new bugs. Rinse and repeat Tuesday, and then again on Wednesday.
When Thursday comes around, our entire team waits for one thing and one thing only: the infamous "Go/No Go" meeting. This is a meeting where all the stakeholders are invited: development leads, architects, QA leads, Product Management, Operations, Customer Support, my sister — you get the idea. We start the meeting by going around the table and asking everyone if they are ready and feel comfortable with tomorrow's release. Very often, issues get raised, which then postpones the deployment another week. For the sake of brevity, let's say everyone gives their "Go!"
On Friday morning, our entire development and QA teams are on hold, waiting for our Operations team to proceed with and execute deployment to our staging environment. We can't have developers working on other features or bug fixes since we are in code freeze. The chances of the deployment having issues and engineering needing to jump in and fix issues are extremely high and very likely. But the deployment does complete to staging, and this is communicated to the team by emails coming from various distribution lists, aliases, an automated Jira system, Ops monitoring tools, etc. And then it's all-hands-on-deck, with QA and engineering in a lengthy, manual process of testing and validating the build that was just pushed to staging. All this manual QA pays off, and now things are good-to-go to prod. After another "Go/No Go" meeting, we confirm that the deployment will happen later that afternoon. And then we go through the exact same drill: we wait, Ops does the deployment, engineering and QA remain in a state of readiness. Only now it's Friday, and our afternoon is quickly turning into early evening.
The deployment completes, and there's minimal impact on our customers. We are all anxious to get home and enjoy the weekend, but there's still a slew of very manual tests and regression checks we need to complete before claiming victory. We never feel comfortable declaring success with any deployment. We leave the office on these deployment Fridays with a number of people having volunteered to be on-call over the weekend because something is bound to go wrong.
Since we're a cloud company, our customers never sleep, and there's always a customer who logs on over the weekend and runs into a bug, or worse, cannot access their data. And that's exactly what happens. Customer support requests and emails start pouring in. Twitter blows up with complaints. Blog trolls do their thing. My phone starts ringing. It's 4:00 am Saturday morning.
Let the hot fix fire drill begin! The first call is usually about what the issue is, who is impacted, how we can fix it, who can fix it and can we get it out to the customer before Monday. We then gather a very unhappy engineering corps to start working on the issue over the weekend, all while receiving, what seems like every minute, requests for an update on the issue. The second call is usually around planning when we think we can push the fix. The third is, of course, the "Go/No Go" call for the hot fix, and no hot fix can be deployed without doing a post-mortem. Very often we have to repeat this entire process multiple times because the hot fix broke something else that now requires its own hot fix.
What words come to mind: dysfunctional, unhappy customers, unhappy employees, not scalable, not resilient, frustrating, time-consuming, <insert_yours_here>.
Did I fail to mention that this is what occurred for most of our releases 4 years ago? In 2016, we push dozens of times a day. There is a sparse number of QA members on our team. And for the past 2 years, our weekends are exactly that: ours. We sleep in, we relax, and basically enjoy the two days of rest. If we want, we can release to production at any time of day or night with virtually no impact to our customers. No one is ever on call. The word hot fix is no longer part of our dictionary. Our customers are delighted. Our team is happy. And this is in large part due to a focus on and adoption of Test Driven Development (TDD). TDD is "a software development process that relies on the repetition of a very short development cycle: first, an (initially failing) automated test case is written that defines a desired improvement or new function. Then, the minimum amount of code is written to pass that test, and finally, one refactors the new code to acceptable standards." [source: Wikipedia]
Thanks, Nathan.
You can see that although not entirely perfect yet, via TDD, our web-based deployment teams have made great strides in the past 4 years by developing a regular cadence and carefully managing the tasks and bug lists in their Jira queues.
Controlling deployments even under uncertainty is alive in the lab.