Nine of us were sitting in a crowded pizza place, having ordered dinner after a long work day. Suddenly, there was a conspicuous silence on the other side of the table. I looked over to see that a few of my teammates were frantically scrolling through their phones. A few minutes later, I realized that some of our code has broken, and the person on-call has been called to help fix things.
Getting paged can be quite stressful. I’ve been paged in the middle of the night a few times because some code, owned by my team, that I wasn’t really familiar with had broken. It’s hard to concentrate on fixing the problem when you’re groggy and trying to wake up, or sitting in a noisy restaurant with no space on the table for your laptop.
There are a few things that I find amazingly helpful during incident response.
Runbooks for on-call response: It is gratifying to see that the issue that someone has thought about the issue you’re supposed to fix. Having clear step-by-step instructions for resolving the issue is a lifesaver.
Being able to focus on the problem is paramount. It is easier to focus on solving the problem if someone else is there to do administration and communication work.
(A small) Bias towards reverting instead of fixing forward. It’s easy to fall for the temptation of preparing a fix for the bug instead of reverting the buggy PR. This can be complicated, and most of the times, it’s better to revert as quickly as possible and follow up later with a fix. However, this isn’t absolute, there are many cases where a fix forward is the best solution. I try to use my best judgment.
In this case, most of us focused on our food while three people took incident response. The offending code was reverted and things went back to normal. As a bystander, I didn’t feel much stress, but I can understand how hard it must have been for the responders.