Before the philosophy of DevOps, developers would build products, services, and infrastructures, but the responsibility for maintaining them would shift to operators, aka system or IT admins. The DevOps philosophy removes the boundary between Operations and Development teams, making system reliability a shared responsibility of all parties.
By adopting DevOps philosophy at OpsGenie, we’ve decided not to have a separate Ops team. Instead, development teams are responsible for the operations of the systems they build, which makes being on-call a critical duty that they must undertake to keep their services reliable and available.
That is why each time a new engineer joins our team, one of our challenges is preparing her to take on-call responsibilities. Otherwise, the first day on-call can be frightening — both for the engineer and the organization. To prepare for on-call duty, an engineer must be familiar with the concept of being on-call, be competent in diagnosing and correcting problems in the tech stack, and have access to relevant tools, accounts and appropriate permissions. In this guide, we take a look at seven tips to get your team’s new engineer ready for on-call duties.
To be ready for an on-call duty, a team member should first have an understanding of the basics of your team’s on-call schedules and escalations that are used to determine who is on-call at a given time and escalate when needed.
An on-call setup may get sophisticated based on your organization's requirements. Our OpsGenie on-call setup has different rotations for the day and the night shifts and both a primary and a secondary on-call schedule with the secondary acting as a fall-through for the alerts missed by the primary on-call engineer.
Our setup also has appropriate escalation procedures in case the engineer is stuck in traffic, phone’s battery dies, or is just lacking knowledge and needs to escalate the alert.
On-call engineers should always be accessible. Ensure that your newcomer has set up notification rules to be notified in case of an alert.
As a best practice, you can recommend classifying alerts to use different notification rules based on the alert priority levels. For the high priority alerts, use a combination of mobile push and voice notifications to ensure a timely response. For low priority or informational alerts, choose either method — email, SMS, mobile push or voice — or choose not to get notified to prevent alert fatigue.
You should also make sure that the new engineers have joined the related Slack channels (or whichever chat platform you use). Here are the channels we invite our newcomers:
Accelerating response to the most critical incidents is the goal, so it is vital that the on-call engineer is able to classify the issue, troubleshoot on the go, and send a fix if needed — which requires managing the deployment processes. To do this effectively, make sure that the engineer is familiar with the necessary commands and has certain permissions to the environments.
Here’s a list, but for a particular job it may vary:
Accelerating response to the most critical incidents is the goal, so it is vital that the on-call engineer is able to classify the issue, troubleshoot on-the-go, and send a fix if needed — which requires managing the deployment processes. To do this effectively, make sure that the engineer is familiar with the necessary commands and has certain permissions to the environments.
Knowing your organization’s infrastructure can help to quickly understand the cause of an issue. Often you can solve problems quicker because you know how the system is laid out. To achieve this, have sessions to explain the infrastructure and tech stack as part of your onboarding process. Make sure that related documentation is up to date and covers all the bases.
Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., so on-call engineers should get familiar with the tools used. Here’s a list we use to train our newcomers:
New Relic Insights
You can identify a complex problem by querying the correct event along with the correct metrics in a very short time for most scenarios. At OpsGenie, we use NewRelic insights to query data that we generate from apps. An on-call engineer should get familiar with the metrics data sent to Newrelic and have a sound knowledge of New Relic Query Language (NRQL).
OpsGenie monitors almost all of the AWS services it uses via CloudWatch, which gives visibility into resource utilization, application performance, and operational health. An on-call engineer should know the basics of CloudWatch service and which metrics, logs, and graphics can be found in the AWS Console to analyze an incident.
OpsGenie stores customer logs on Graylog, therefore, on-call engineers should know the different types of logs and their usage. The engineer must be familiar with Graylog's search function, which requires knowledge of the elasticsearch data schema.
It is crucial to be aware of an upcoming on-call duty. To achieve this, ensure that your new- comer has configured schedule notification rules — the time to get notified before her shift starts and an appropriate notification method — to be notified before on-call duty is started.
The responsibilities of an on-call engineer as an incident responder should clearly be defined. This will prevents burnout, confusion, and frustration of being an incident responder. We suggest documenting your incident response process and the expectations for on-call behavior.
Some expectations may be listed as:
Being on-call is a critical duty to keep your services reliable and available. With the DevOps movement, it is now a best practice for those who build services also to be accountable for the success of their services. Each time a new engineer joins your team, you should prepare them to take the on-call duty. We hope that starting with the tips in this guide you and your team will improve the on-call onboarding processes and make being on-call easier for a new engineer.