Runbooks Requirements and Guidelines
This page provides guidelines for creating or updating a Runbook.
For high-level "meta" documentation about the Runbooks application suite
(i.e. how it works "under the covers"), see RunbooksOverview.
Below are the requirements and guidelines to follow when writing runbooks entries.
Please feel free to add more.
If you have objections to one or more of the guidelines below, let's talk about it.
What is a Runbook?
concise specific steps
for Triage personnel to use as a first line of defense.
Runbooks Are NOT
detailed descriptive instructions
for Service Engineers and property owners.
Push the button: get the banana
Write as if you are explaining to someone
who is inexperienced with troubleshooting your property (e.g. your manager or a temp).
Make the instructions clear, concise, complete, and specific.
Our SLA (alert to resolution) is 20 minutes.
as well as our Canonical Sample Runbook
*If the only step is "Escalate", please change monitoring
so that alerts go directly to the SE or the developer.*
What If There's Just One Thing to Do?
_My procedure is so simple.
"If you see this alert, restart Apache.
If that doesn't work (and it almost always works), escalate."
Do I have to create a whole Runbook?_
If you can truly keep it to two steps (or better yet, one!),
with no decisions to make, just update the triage map directly.
- If the only step is "Escalate", please change monitoring
so that alerts go directly to the SE or the developer.
Requirements for Every Runbook
Every Runbook must have:
- Problem Definition
Summarize, but be precise.
- The Error / Alert
A general example is good; a specific example is better.
- Steps to Resolve
Fewer is better. Follow the guidelines below.
- Steps to follow if resolution does not work
Guidelines for Writing Procedures
Push the button, get the banana
Clear, Concise, Complete, and Specific
- Make the instructions clear, concise, complete, and specific.
Remember your Audience
- The person reading these instructions will be under time pressure. Our SLA is 20 minutes from alert to resolution (or escalation)!
- He or she most likely does not understand the property as well as the property owner..
- Write as if you are explaining to someone who is inexperienced with troubleshooting your property (e.g. your manager or a temp).
Background vs. Action
- Summarize any background information in the "executive summary" field of the attached form. (Always include a brief description of the problem in this section.)
- The "Steps" section should contain only actions.
Policy Over Choice
Whenever possible, err on the side of Policy over Choice.
Choices (options) require more thinking, may require advanced knowledge,
and take additional time to consider.
The chosen option may not be remembered and may not be logged
(i.e. when there are two or more choices, we may not be sure which one someone picked;
however, if there is only one, single, prescribed step, we will always "know" what was done.)
- If you must present options, make it obvious that they are options.
- If you have more than two options, reconsider.
- If one choice is favored, list that one first and label it as preferred.
- If you have more than two if conditions, please discuss the procedure with the Triage team! Remember: err on the side of Policy over Choice.
- For clarity, repeat any "If..." conditions at the top of any mutually exclusive steps that require a specific condition to be met.
1. If many hosts are alerting, proceed to step 2.
If only one host is alerting, skip to step 3.
2. If many hosts are alerting...
3. If only one host is alerting...
If it Looks Like a Script...
- If your instructions look like a script... write a script!
- When possible, enter commands in a way that can be easily copied and pasted on a command line.
- Do not include the prompt in the command
- If there's a "variable" in the command, write it like this --
(VAR) -- and describe how to replace it with real daa. Example:
Fetch the members role:
roles fetch -members (ROLE)
replacing (ROLE) with the name of the actual role
- Use Imperative / Active voice as much as possible. For example, use "Run", "Do", "Check"... rather than "You can", "you might", "you should"...
- Break out individual instructions onto new lines for ease of reading. If there is something new to do, place the instruction on a new line.
- Break out very different instructions into another (numbered) step. Use your judgement about whether an instruction is closely related to a previous item or deserves its own "step".
- Refer to other runbooks entries using links. e.g.
[[RunBk7][submit a maintenance ticket]]
The First Step
A recommended first step for all runbooks:
The Last Step
How do we know if the fix worked? What do we do if it didn't?
- Make liberal use of white space. Create instructions that are easy to find, read, understand, and use, quickly and accurately.
- Use Wiki markup - bold, italic, bullets...
fixed-width font for code (things to type or sample output).
We have plenty of examples available for you to look at.
Start with RunBk111
, our Classic Generic Example.
If you want more, peruse the Index of all published runbooks