Plans and Playbooks

I should begin by clarifying what I mean by the word "playbook" and how that relates to our recovery plans. A playbook is really the official, formal written record that describes policies and processes that will reliably produce a working deployment of an organization's resource stack. When it comes to generating predictable results, the playbook is the plan.

I'll describe all the key elements of a good playbook in just a moment. But it's important to emphasize that a playbook on its own is more or less useless unless your team is able to read it and convert it into real-world results. To do that you'll need to make sure every relevant member of your team is completely familiar with their roles and how they'll be expected to carry them out. That'll require you to distribute copies of the plan and ensure that everyone gets the training they'll need to perform perfectly when the time arrives.

At any rate, a good plan begins with clear definitions. Where can you find up-to-date and clean copies of the source code? Where should your production environment be hosted? In a public cloud like AWS? On-premises? What is the infrastructure supposed to accomplish? What's the scope of your operation: what scale of hardware resources will it require?

A playbook should also clearly define the policies that must be followed through the rebuilding process. How is organizational data to be protected? What decisions must be made only by senior company officers? Are there restrictions on what software and third party solutions can be used...or from which countries they can be acquired? Are there stack components that must remain local, or can everything live in the cloud?

Perhaps the core of any playbook is the section addressing the software and deployment tools and procedures that you'll use at every stage of your workflow. This section should include the complete code for the scripts handling moving resources from code to deployment, along with links to all the software code in use, and instructions for authenticating to the services you'll be using.

IT deployments are performed by people. But which people? Who do you speak to who has access a credit card so you can purchase needed resources? Who has access to the key codebases and online accounts you'll need? Who's responsible for testing and signing off before code is pushed to production? What if that person isn't available? Each and every role relating to the project you're documenting needs to be defined, and the person responsible must be identified - along with current contact information.

Recovery operations, obviously, can be chaotic. But it's nevertheless critically important that log records for every step - pre- post- and during recovery - should be kept. Therefore, log generation and storage should also be part of your playbook. Even if you don't have the time to read them right now, they'll be invaluable later as you try to review events and figure out exactly what happened. The existence of accurate and reliable logs and other records might actually be legally mandated.

Any code review and application testing you would normally incorporate in your deployment lifecycles should be included in your recovery playbook. After all, bugs and failures aren't going to be any more fun after a crisis than they were before it. Here too, the actual code for all the scripts that would normally power your testing should be included.

Remember how I told you that you should include complete operations scripts and links to your code base in the playbook? Do you think our playbook could be convinced to play itself? Why not?

Think about it. Orchestration tools like Ansible or Terraform - or cloud-specific tools like Amazon's CloudFormation - allow you to very closely define every layer of your infrastructure in a format that can be invoked and launched with a single command. In theory at least, there's no reason why you couldn't build your playbook as an actual script, complete with commands to pull software repos, launch complex virtual networks and compute instances, and route DNS domains. That would be a fantastic example of the power of infrastructure as code.

Go Pro to unlock all content & remove ads

Plans and Playbooks

Outline

Outline

I finished! On to the next tutorial