Effective firefighting for product managers

How PMs should deal with production issues

Victor Wu

--

Production issues suck. They are unavoidable. Even Google and Facebook have outages every now and then. Your customers experience degraded service or even no service at all. It’s terrible for startups since your precious first-time customers may never return. In the age of mobile apps, it can be especially annoying if your backend services are down, resulting in just ridiculous loading spinners on an app screen. (You can quickly throw up a fail whale page for a website. But it’s harder to architect your native app ahead of time to control a degraded customer experience in real-time.) As a product manager, your business stakeholders and technology resources look to you to take charge and control the situation. There’s no universal playbook to follow. But there are a few points to consider when you are firefighting.

Laser focus

It’s called firefighting for a reason. Just like a real fire, value is being burned away with every passing second. The leader on the ground has to make quick decisions. There’s no time to think, as it were. Every action or inaction may have huge ramifications. A chief surgeon making a split-second decision in the operating theater. A SWAT team commander leading a team into a hostage situation. Bill Pullman rousing the troops before taking down the aliens. In all these scenarios, the leader has to be laser-focused on the task at hand, whatever that may be. It could be restoring service. It could be contacting customers who have had their security compromised. It could be preventing literal fires. (Yes, I’m talking to you, Samsung.) Typically, there’s that one obvious goal that is mission-critical and must be achieved as soon as possible. All other goals can be dealt with days or even weeks later. And so it’s your job as the product manager to get everyone in the room aligned and focused on that one goal. Business stakeholders might want to start assigning blame. An operations lead might start drafting an apology email on the whiteboard in the war room. A lead engineer might start breaking down the root cause, explaining why mismanaged code led to a broken microservice. A test engineer might start brainstorming out loud better automated integration tests. This is all noise, and only delays problem solving and service restoration. Politely but sternly remind everyone in the room to first focus on the number one priority outcome to be achieved. Gather the information needed to make any immediate decisions. That may require waking up your engineers. But it shouldn’t take more than 30 minutes for that initial meeting to develop a game plan, assigning appropriate tasks to each individual. Dismiss everyone so that separate teams can actually do work and problem solve.

Samsung has been firefighting in the past few weeks

Lead from behind

Spend up to 15 minutes discussing the business side of the production issue with your engineering team, outlining the business impact and urgency of the scenario. Take their questions and make quick decisions, using your best judgment. Now’s not the time to get approval from all your business stakeholders. You need to push forward, especially if every second matters. Once you’ve established the baseline goal with your engineering team, step back, and let them work. Don’t disturb them with multiple tasks. Don’t micro-manage. Don’t inundate them with customer complaints or any other escalations if they are irrelevant to their immediate task. You want your engineering team to be executing as quickly as possible, on a singular goal. And the best way to achieve that is to leave them alone. Trust them. They’re all you’ve got. At that same time, don’t run away either. Cancel all your other obligations. Stay in the same room, within audiovisual range. This lets the team know that the production issue is still your number one priority, without you constantly bothering them. At that same time, it shows that you are immediately available, lending technical and moral support if required. Throughout this process, ideally, the engineering manager should be leading the show, managing the various technical tasks and problem-solving activities. They are best positioned to make those immediate technical decisions, having the pertinent technical and business understanding of the platform. If the engineering manager is absent (which is often the case, unfortunately), or non-existent, leverage the lead engineer or any engineer on the team, and give them ownership of the problem. In my experience, these firefights end up serving as baptisms of fire (yes, I’m mixing my analogies/idioms) for engineers. The good engineers spring into action, taking ownership and initiative, often without being asked to. Others take a that’s-not-my-codebase-and-I-can’t-help attitude. If there’s one thing a production issue is good for, it’s to figure out who you want on your team.

Communicate

Check in with your engineers to get regular updates. But of course, don’t bug them. (See above.) Half an hour is a good rule of thumb. If there isn’t any obvious progress within that time, ask for a frank assessment from your team and use your judgment. You might need to escalate the issue and set expectations to your business stakeholders that the problem won’t be solved as quickly as you hoped. Typically, however, your engineering team should be able to discover the problem and start implementing a fix within half an hour. They might discover additional complications along the way. So that’s yet another reason for you to be there the whole time to manage those concerns and make decisions.

If it’s a service outage, your techops organization probably already has a communications process for these scenarios. If every second counts, techops should be sending updates at least every 15 minutes. If you are a small startup, you might be the one sending those emails. In any case, the purpose is to inform the greater organization of the scope and impact of the problem, and progress in resolving it. There might be many folks who are interested but are not necessarily involved in the core problem-solving team. And you don’t want those folks that team. So the content of these communications should be business impact driven. You should expose only relevant points. Explain technical details only relevant to the immediate problem. At the same time, of course, don’t sugarcoat any outstanding risks. Again the theme is using your judgment and experience in these time-sensitive and mission-critical scenarios.

--

--