I never liked being on-call (slight understatement) or asking others to shoulder some of the load. Sometimes it feels like it’s a penalty for being more involved and knowledgeable about our code and infrastructure. And it definitely is a big distraction from core development and innovation.
But there really is no way to avoid it once you have a live product or website with paying customers. Somebody needs to be available just in case something goes wrong.
How on-call is done in your organization or by your prospective employer can make all the difference in your success (and sanity). Here are some approaches I’ve seen that can improve the on-call experience and overall productivity.
Wake Up R&D!
Being woken up in the middle of the night due to the NOC or support team opening a high severity ticket, only to find out that it was a relatively non-critical issue absolutely sucks.
To solve this all too common scenario, one company I worked at came up with a simple solution. They replaced the “High Severity” designation with “Wake Up R&D.” By clearly outlining the result of opening a high severity ticket, they forced the opener to think twice (maybe even thrice) about whether the issue was really worth waking someone up in the middle of the night.
Make sure that you or your prospective employer has a good method for separating the signal from the noise.
Junior and New Employees
For junior or new employees who might be unfamiliar with all the intricacies of what constitutes a critical issue, how it should be handled, etc., it’s essential to have a runbook or some other documentation that outlines what issues warrant waking R&D up in the middle of the night.
While this type of documentation goes a long way in describing various scenarios, their severity and how they should be handled, it takes a few months for someone to get a sense of the systems they’re working with, and be able to classify incidents accurately.
Make sure that you or your prospective employer invests the time and training for newbies to ease them through this learning process.
The Fifth DORA Metric
Well, it might be the sixth Dora metric as Google added a fifth already in 2021.
Either way, hat tip to Charity Majors, CTO at Honeycomb who suggests in this excellent blog post that software engineering management should be evaluated not only by the four original DORA metrics but also by how often their “team is alerted outside of working hours.”
This makes perfect sense to me. Management must do their utmost to ensure productivity.
Why do I make this bold claim? Well I can only speak for myself but if I am feeling stressed about my upcoming on-call duties, I won’t be focused on my work. If I’m tired the day after on-call, I won’t be sharp and creative. If I’m feeling overworked and underappreciated for my core contributions, I will be less motivated to give my utmost effort.
Make sure that you or your prospective employee respects employee overall well-being and understands that on-call duties can be very draining.
The Fear Factor
Earlier in my career, I’d go through many emotions during on-call incidents. How will I be judged if I don’t know how to handle the situation on my own? It’s 2 a.m. — what if I’m totally off, and this is not an issue at all? Do I want to risk the wrath of the senior expert I barely said two words to since I joined the company?
These and many other thoughts would race through my head, and no matter the time of day or night, I was fortunate to have a close working relationship with my direct manager and would ping him whenever I was really unsure of what to do.
As CTO at Kubiya.ai, I try to create a healthy balance. Waking a teammate in the middle of the night should obviously be avoided, but at the same time is totally fine provided we did everything we could to solve it on our own. And even if it turns out to be a false alarm or some easy fix, I say better safe than sorry. But this takes coaching and publicly stating to the team that this is our approach so everyone is on the same page and no one is terrified of making the call.
If you are building your on-call structure, clearly communicate that we must try and avoid waking colleagues, but at the same time, it’s perfectly acceptable if we need to (and even if we are wrong, it’s ok too). If you are evaluating a prospective employee, try and gauge what their culture is like and ask how they address this issue.
Technology teams are not known for their outstanding communication skills. And when you throw in a tense, sev 1 situation, I’ve seen people on-call think they have tracked feature ownership correctly and unfortunately when they reach out to the “owner” it comes across as super accusatory.
They approach a developer with a buggy piece of code that seems to belong to them. They will be like “hey your code is causing the app to crash, blah blah,” and all of a sudden, the developer has an important meeting and cannot help — or worse, gets super defensive and mouths off.
So it’s super important to avoid any accusatory or critical tones and/or wording when you think you’d found the issue and the person who can help.
It’s really hard to know if it’s the specific code that is at fault. Maybe it was a change in firewall configuration, or perhaps it was a related but different component that is causing the issue. Refactoring code can also make it look like someone was the author even though they are not.
Plus, if someone wrote something over a year ago and there were many iterations, it will take them some time to dig back in and understand.
Always be humble when raising an issue. Don’t jump to conclusions or blame anyone. Just ask for help, suggestions, and ideas from the people you think might be able to help. Let them know that while you are not sure they are the right address, you thought that perhaps, because they were involved at some point with the code, they could help.
Who Should Be On-Call?
This is really a tough question, but in my experience, operations teams should always have someone on-call. That said, if operationally things are very stable while applications are not so stable, developers might need to be the regular members of on-call rotation.
In smaller companies, devs should probably have ops capabilities anyway, so they can cover all issues.
Of course, during critical releases, particularly new features, devs should be on call.
Try and make sure that your teams are well-versed in all relevant areas to the extent possible, but obviously, this may not be realistic. So identify where the system is weakest and allocate on-call accordingly. If you are evaluating a prospective employer, try and see whether your role (dev or ops) would carry the brunt of on-call and make sure it’s reasonable and well compensated for.
At the end of the day, on-call is the toll we techies must pay. But it’s worth it. Hang in there!