Your Bus Factor is a Blast Radius Waiting to Happen
We know how it goes. During every sprint or PI planning; there is one module that is already “assigned”. It’s the “payment service”, and Sharon is the SME.
As engineers, we have a darker than normal humor. We laugh at “How screwed we are if Sharon gets hit by a bus.” We try and resolve it during sprint and PI planning by allocating other resources under the guise of “cross-pollination” or “cross-training”.
But let’s be real, we all know what happens. All of a sudden the sprint or the PI is at risk, we don’t have the time to “train” someone else. We give the work to Sharon.
Cross-training never occurs.
And the land of best intentions is littered with failed ideas (or something like that).
We view this as a people issue. But what if I told you that it isn’t an HR issue? That it’s an architectural issue that is more prevalent than you think. This “bus factor” is a quantifiable, single point of failure that has a blast radius that can threaten your entire system. You have the ability to identify it; the data is in your Git log.
This isn’t a gut feeling; this isn’t a “wow it seems like Sharon always gets stuck with the payment service”. It is a verifiable fact. Run a git log on that services directory. Filter it by author. Now you have quantifiable proof; Sharon has made 98% of all commits over the past 3 years. A key component of your business, your system, is dependent on a single person.
The problem should not be dealt with when Sharon leaves the company. The problem is RIGHT NOW.
- Sharon creates a bottleneck. Every feature, every bug fix, every change that touches that service must now pass through her queue. Her available time becomes the limiting factor for entire product roadmaps.
- It creates a "haunted graveyard." Other developers become terrified to touch the code. They don't understand its nuances or its hidden side effects. The code becomes "write-only" for everyone but the SME. This stifles innovation and creates a culture of fear.
- It guarantees a future crisis. When Sharon eventually moves on, that service becomes an unexploded, unknown landmine. The next developer who is forced to make a change is operating blind. A small, seemingly harmless modification can trigger a catastrophic, system-wide failure because the invisible context is gone.
We need to let go of the dark humor (as much as we may love it), and recognize that Sharon may get hit by a bus tomorrow, or she may get a new job offer, or . . . and realize that the issue is actually: an unmanaged, single-human dependency with a measurable blast radius. We must recognize that this blast radius is not constrained to the Payment Service, but also it’s upstream and downstream dependencies. And these dependencies are at risk for when, not if, that dependency fails.
We MUST treat knowledge distribution as a first-class architectural metric. We MUST go beyond “lunch and learns, and not just whine to HR that we need more headcount. We need automated systems that measure this data, and can provide a dashboard allowing for this data to be surfaced. As such, it needs to be right next to our CPU utilization, our SLO and SLAs. We need to treat a bus-factor of one on a critical service, the same way infrastructure treats a zero-day vulnerability.
We must challenge our leadership but not by asking, “How do we get Sharon to write more documentation?” (because we all know how THAT goes)
We must ask, “What is our data driven plan to identify, and mitigate the single points of failure of people like Sharon?”
(and you will find better flexibility in sprint/PI planning, and you will also work towards preventing burn out, but those are covered in other blog posts)