Essays

Self-servicification of internal teams

Common pitfalls of interal services teams moving to a self-serve model and what to do about it.

I've been recently thinking about platform teams at work, specifically teams that are moving from a services model to a self-serve model.

Infrastructure teams no longer set up services for you--instead you follow a runbook running commands you don't understand, UXR--it's up to you to recruit users to talk to and conduct interviews, and data science--do your own analysis and write queries.

To generalize, a specialist team gets too overloaded as a shared resource and pushes some of the responsibilities/work to the service consumer which frees up the specialist team to work on more high leverage things (usually more tools or processes to help downstream consumers).

What happens is this change in operating model results in the same work being distributed to a much larger pool of people, but it also generates a surprising amount of inefficiencies over a larger group of people and I'm skeptical this isn't truly optimal compared to staffing up team in question in most cases.

One of the assumptions made on moving a team from a service model to a self-serve model is that that some set of common tasks can be distributed and therefore resources can be drawn from a much larger group of people which exceeds the capacity of the team. For that to be true, tasks need to be well defined (e.g. follow these steps to start a server) rather than free form (e.g. solve my novel problem please). (As an aside, if it were a truly well defined procedure then it probably should just be automated completely!). In addition, tools and systems to support self-serve tasks need to be in place to drive better efficiency and maintain quality.

In practice, I find a common pitfall is that tasks are not as well defined as teams think and only a subset of the original service work is actually distributable. Tools that were ok for the team are usually insufficient for wider consumption. Every undocumented exception and workaround will be rediscovered by the group the work is distributed to.

This amounts to a massive amount of additional work that isn't accounted for when moving to a self-serve model. Quality of the task tends to drop and counteracted with more process or (hopefully) automated tools which places a larger burden on self-serverers and the responsible team to make it better.

Does the services team actually reclaim bandwidth to work on higher order things? Does the collective leverage of self-serve users actually improve efficiency? Maybe not.

But there is another, bigger danger--when many services move to self-serve and inherit the inefficiencies mentioned above the whole company slows down.

Inefficiencies like this tend to go unnoticed since the work is now distributed to a larger group. There might be complaints, but it's difficult to coalesce that feedback. If you asked people why their velocity keeps declining, they can't quite pinpoint what it is since we tend to rationalize it away, 'oh yeah that is annoying, but I only need to do it once a month'. The distributed nature makes it tricky to diagnose because you need to somehow aggregate the impact across everyone.

Is there something we can do to prevent this? How should teams that are considering moving to a self-serve model think about making this change?

The mental model I adopt is a quasi 'net operating efficiency' where the significant variables are how well can a common task be defined, how many potential self-serve users performing the task how often, the cost of tooling and maintenance, the cost of administrating the process, and bandwidth saved by the owning team, over some useful lifetime.

Teams tend to think the bandwidth it will save them rather than what's optimal. This is why this pattern in aggregate can cause a 'tragedy of the commons' situation.

Before making a decision, it's critical to consider these factors and alternative solutions. Organizational leaders should have much more accountability to operating model changes to counterbalance the incentives of any one team. It's easy to account for the cost of additional headcount, but hard to account for future lost efficiency and productivity across a large group of people.

It might be that hiring some more people or investing in team efficiency is the right thing to do rather 'scaling' by trying to become yet another platform team.