Before the philosophy of DevOps, developers would build products, services, and infrastructures , but the responsibility for maintaining them would shift to operators, aka system or IT admins. The DevOps philosophy removes the boundary between Operations and Development teams, making system reliability a shared responsibility of all parties.
By adopting DevOps philosophy at OpsGenie, we’ve decided not to have a separate Ops team. Instead, development teams are responsible for the operations of the systems they build, which makes being on-call a critical duty that they must undertake to keep their services reliable and available.
That is why each time a new engineer joins our team, one of our challenges is preparing her to take on-call responsibilities. Otherwise, the first day on-call can be frightening — both for the engineer and the organization. To prepare for on-call duty, an engineer must be familiar with the concept of being on-call, be competent in diagnosing and correcting problems in the tech stack, and have access to relevant tools, accounts and appropriate permissions. In this guide, we take a look at seven tips to get your team’s new engineer ready for on-call duties.
Tip #1: Explain the basics of your team’s on-call schedules and escalations
To be ready for an on-call duty, a team member should first have an understanding of the basics of your team’s on-call schedules and escalations that are used to determine who is on-call at a given time and escalate when needed.
An on-call setup may get sophisticated based on your organization's requirements. Our OpsGenie on-call setup has different rotations for the day and the night shifts and both a primary and a secondary on-call schedule with the secondary acting as a fall-through for the alerts missed by the primary on-call engineer.
Our setup also has appropriate escalation procedures in case the engineer is stuck in traffic, phone’s battery dies, or is just lacking knowledge and needs to escalate the alert.
Tip #2: Set up their alert notification rules
On-call engineers should always be accessible. Ensure that your newcomer has set up notification rules to be notified in case of an alert.
As a best practice, you can recommend classifying alerts to use different notification rules based on the alert priority levels. For the high priority alerts, use a combination of mobile push and voice notifications to ensure a timely response. For low priority or informational alerts, choose either method — email, SMS, mobile push or voice — or choose not to get notified to prevent alert fatigue.
You should also make sure that the new engineers have joined the related Slack channels (or whichever chat platform you use). Here are the channels we invite our newcomers:
- #deployment, #deploy-status: The channels in which the status and each step of any deployment task are posted.
- #operations : The channel in which the production alerts are posted.
Tip #3: Make sure they have access to the right tools and correct permissions
Accelerating response to the most critical incidents is the goal, so it is vital that the on-call engineer is able to classify the issue, troubleshoot on the go, and send a fix if needed — which requires managing the deployment processes. To do this effectively, make sure that the engineer is familiar with the necessary commands and has certain permissions to the environments.
Here’s a list, but for a particular job it may vary:
- Access to management tools
- ChatOps commands
- Links to the runbooks
Accelerating response to the most critical incidents is the goal, so it is vital that the on-call engineer is able to classify the issue, troubleshoot on-the-go, and send a fix if needed — which requires managing the deployment processes. To do this effectively, make sure that the engineer is familiar with the necessary commands and has certain permissions to the environments.
Tip #4: Make sure that they know your infrastructure and tech stack
Knowing your organization’s infrastructure can help to quickly understand the cause of an issue. Often you can solve problems quicker because you know how the system is laid out. To achieve this, have sessions to explain the infrastructure and tech stack as part of your onboarding process. Make sure that related documentation is up to date and covers all the bases.
Tip #5: Train them on the relevant diagnostic tools
Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., so on-call engineers should get familiar with the tools used. Here’s a list we use to train our newcomers:
New Relic Insights
You can identify a complex problem by querying the correct event along with the correct metrics in a very short time for most scenarios. At OpsGenie, we use NewRelic insights to query data that we generate from apps. An on-call engineer should get familiar with the metrics data sent to Newrelic and have a sound knowledge of New Relic Query Language (NRQL).
OpsGenie monitors almost all of the AWS services it uses via CloudWatch, which gives visibility into resource utilization, application performance, and operational health. An on-call engineer should know the basics of CloudWatch service and which metrics, logs, and graphics can be found in the AWS Console to analyze an incident.
OpsGenie stores customer logs on Graylog, therefore, on-call engineers should know the different types of logs and their usage. The engineer must be familiar with Graylog's search function, which requires knowledge of the elasticsearch data schema.
Tip #6: Set up their schedule notification rules
It is crucial to be aware of an upcoming on-call duty. To achieve this, ensure that your new- comer has configured schedule notification rules — the time to get notified before her shift starts and an appropriate notification method — to be notified before on-call duty is started.
Tip #7: Define their responsibilities as an Incident Responder
The responsibilities of an on-call engineer as an incident responder should clearly be defined. This will prevents burnout, confusion, and frustration of being an incident responder. We suggest documenting your incident response process and the expectations for on-call behavior.
Some expectations may be listed as:
- When should they acknowledge an alert?
- How should they prioritize and classify an Incident?
- When should they escalate to more senior team members or other teams?
- What to do when the problem is something that another team should look into? When to inform appropriate Stakeholders such as executive leadership and customer support?
- When should they enter a Status Page entry?
- How should they handle short periods of time where they need to be away from their computers?
- How to document Incidents for post mortem analysis?
Being on-call is a critical duty to keep your services reliable and available. With the DevOps movement, it is now a best practice for those who build services also to be accountable for the success of their services. Each time a new engineer joins your team, you should prepare them to take the on-call duty. We hope that starting with the tips in this guide you and your team will improve the on-call onboarding processes and make being on-call easier for a new engineer.