Runbooks Requirements and Guidelines

This page provides guidelines for creating or updating a Runbook. For high-level "meta" documentation about the Runbooks application suite (i.e. how it works "under the covers"), see RunbooksOverview.


Below are the requirements and guidelines to follow when writing runbooks entries. Please feel free to add more. If you have objections to one or more of the guidelines below, let's talk about it.


What is a Runbook?

choice-yes Runbooks Are concise specific steps for Triage personnel to use as a first line of defense.

choice-no Runbooks Are NOT detailed descriptive instructions for Service Engineers and property owners.

Push the button: get the banana

Write as if you are explaining to someone who is inexperienced with troubleshooting your property (e.g. your manager or a temp).

Make the instructions clear, concise, complete, and specific. Our SLA (alert to resolution) is 20 minutes.

See Examples as well as our Canonical Sample Runbook.

choice-no IMPORTANT - *If the only step is "Escalate", please change monitoring so that alerts go directly to the SE or the developer.*

What If There's Just One Thing to Do?

Q _My procedure is so simple. "If you see this alert, restart Apache. If that doesn't work (and it almost always works), escalate." Do I have to create a whole Runbook?_

A No. If you can truly keep it to two steps (or better yet, one!), with no decisions to make, just update the triage map directly. IMPORTANT - If the only step is "Escalate", please change monitoring so that alerts go directly to the SE or the developer.

Requirements for Every Runbook

Every Runbook must have:

  • Problem Definition
    Summarize, but be precise.
The problem description
  • The Error / Alert
    A general example is good; a specific example is better.
The error
  • Steps to Resolve
    Fewer is better. Follow the guidelines below.
The Steps
  • Steps to follow if resolution does not work
    Now what?
If the resolution fails

Guidelines for Writing Procedures

Push the button, get the banana

Clear, Concise, Complete, and Specific

  • Make the instructions clear, concise, complete, and specific.

Remember your Audience

  • The person reading these instructions will be under time pressure. Our SLA is 20 minutes from alert to resolution (or escalation)!

  • He or she most likely does not understand the property as well as the property owner..

  • Write as if you are explaining to someone who is inexperienced with troubleshooting your property (e.g. your manager or a temp).

Background vs. Action

  • Summarize any background information in the "executive summary" field of the attached form. (Always include a brief description of the problem in this section.)

  • The "Steps" section should contain only actions.

Policy Over Choice

Whenever possible, err on the side of Policy over Choice. Choices (options) require more thinking, may require advanced knowledge, and take additional time to consider. The chosen option may not be remembered and may not be logged (i.e. when there are two or more choices, we may not be sure which one someone picked; however, if there is only one, single, prescribed step, we will always "know" what was done.)

  • If you must present options, make it obvious that they are options.
    • If you have more than two options, reconsider.

  • If one choice is favored, list that one first and label it as preferred.

Options

  • If you have more than two if conditions, please discuss the procedure with the Triage team! Remember: err on the side of Policy over Choice.

  • For clarity, repeat any "If..." conditions at the top of any mutually exclusive steps that require a specific condition to be met.
    For example:
   1. If many hosts are alerting, proceed to step 2. 
      If only one host is alerting, skip to step 3.
   2. If many hosts are alerting...
   3. If only one host is alerting...

If it Looks Like a Script...

  • If your instructions look like a script... write a script!

Entering Commands

  • When possible, enter commands in a way that can be easily copied and pasted on a command line.

  • Do not include the prompt in the command

  • If there's a "variable" in the command, write it like this -- (VAR) -- and describe how to replace it with real daa.

    Example:

Fetch the members role:
      roles fetch -members (ROLE)
replacing (ROLE) with the name of the actual role

Writing Instructions

  • Use Imperative / Active voice as much as possible. For example, use "Run", "Do", "Check"... rather than "You can", "you might", "you should"...

  • Break out individual instructions onto new lines for ease of reading. If there is something new to do, place the instruction on a new line.

  • Break out very different instructions into another (numbered) step. Use your judgement about whether an instruction is closely related to a previous item or deserves its own "step".

  • Refer to other runbooks entries using links. e.g. [[RunBk7][submit a maintenance ticket]]

The First Step

A recommended first step for all runbooks:

  • Is this still alerting?

The Last Step

How do we know if the fix worked? What do we do if it didn't?

Formatting

  • Make liberal use of white space. Create instructions that are easy to find, read, understand, and use, quickly and accurately.

  • Use Wiki markup - bold, italic, bullets...

  • Use fixed-width font for code (things to type or sample output).

Examples

We have plenty of examples available for you to look at. Start with RunBk111, our Classic Generic Example. If you want more, peruse the Index of all published runbooks.


Topic revision: r2 - 07 Jan 2015, VickiBrown
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email