Life of a LinkedIn on-call engineer from Bangalore
(a few months ago, before I joined LinkedIn)
Me: "I can’t quite imagine anyone enjoying being on-call."
Historically, I’d never particularly liked being on-call. I still remember the very first time I went on-call several years ago… I was on-call for an entire week and even when things weren’t on fire, it felt pretty awful. However, I also believed that developers need to be responsible for not just writing code but, in addition to the bug fixes and new feature development, we should also be responsible for managing the entire life-cycle of the app - ensuring its health, maintainability, observability, ease of debugging and deployments & rollbacks. So I never turned down an opportunity to get on-call.
Few months after I joined LinkedIn, I got an opportunity to shadow Groups on-call. It was uneventful, and as far as on-call is concerned, uneventful is good. After triaging a few on-call issues, I felt a little more confident and wanted to go one step further - linkedin.com on-call. But there's a reason why linkedin.com on-call doesn't have engineers from Bangalore. The deployments happened thrice every day, corresponding to US timings - 7:00 am PST, 10:00 am PST, 1:00 pm PST, but since I'm based on LinkedIn Bangalore office, it meant I had to stay up late until ~4:00 am and that's just not feasible. "Blessing in disguise?", I wondered for a moment. But my mind overpowered my heart and I decided to give it a try.
Henry Majoros was the primary on-call and Raul Rivero was the secondary on-call and I was shadowing them. Henry shared a few docs and slides to get me introduced to on-call responsibilities and processes. He even went one step further and had a quick call with me just before starting our shift, where he proposed a nice suggestion - one that would require me to take care of the first deployment alone and shadow them for the rest of the day. This arrangement would mean that I could wrap up around 1:00 am IST, instead of staying up all night. How considerate! Also, it was a win-win situation because the first deployment at 7:30 PM IST is easily doable for me and the US folks could take it easy early in the morning too.
I had one more quick call with Raul where he briefed me about the day-to-day on-call tasks and shared some very handy bookmarks. He gave some quick demos of how to monitor the trunk health and showed how to annotate a PCL failure. Whatever apprehensions I had before, slowly started to fade away.
Due to a scheduled moratorium on the Labor day, we had a very quiet weekend, but we were expecting some firefighting on the following day. Henry started the day with a screen sharing meeting where he explained and performed the first deployment of the day. Although there are some really good documentations around on-call, nothing beats a solid demo. He made it seem super easy (and sometimes it is!). If all goes well, deployment is all about clicking the right buttons on the CRT UI. This was new to me, since the deployment processes that I had seen elsewhere before, were all cumbersome and really hectic. Kudos to the LinkedIn Tools team for nailing it!
This was the day where I was going to deploy linkedin.com for the very first time. Henry woke up early and made sure he had my back, in case something goes wrong. Knowing that and having read most of the docs related to on-call, I felt pretty confident. Fortunately for me, we didn't face any hiccups during the deployment and it was green all the way (Green Day's songs in the background was a pure coincidence)! Dream debut indeed. Time and again, I couldn't believe how effortless the on-call process was here.
For the remaining days, I continued the same schedule, while also monitoring codebase health and annotating build failures as and when needed. I learnt quite a few things during the on-call like smoke testing the canary versions, debugging production issues etc… The last day before the handoff, I presented the updates on the daily standup meeting and wrapped things up.
I never thought I would say this, but the entire on-call experience was enjoyable! The current state of on-call is far from perfect. I doubt if it will ever be “perfect”. The best we can aim for is to get to a point where on-call is something that’s sustainable and leaves the systems in a good enough state and the team in a state that’s not dysfunctional. And we're almost there!
(after one week of shadowing linkedin.com on-call)
Me: "Okay, on-call doesn’t have to be something that people dread."