Outline
From a big-picture perspective, you'll need to think about how likely you are to be hit by, say, a ransomware event that left your business servers unusable for a few weeks - and how much that would cost you in terms of revenue losses and mitigation costs. Against that you might measure the up-front and ongoing costs associated with backups and system hardening efforts. That'll tell you how much you can actually afford, and also how urgently you should work to develop an overall defence plan.
You'll also want to get some sense of how interdependent your systems are: if, for example, your public-facing web applications were to go down, would that impact your back-end payroll and inventory systems as well? Do you have enough engineers available across your organization who could be pulled out of their normal work to cover for an emergency? Might you be subject to legal liabilities and expensive lawsuits as a result of an outage? At what point would the damage from a major outage be so great as to threaten the ongoing viability of your operations?
That last point is worth diving into a bit deeper. As it turns out, the IT admin world has developed some particularly useful metrics for measuring your organization's base needs, and anticipating the steps you'll need to take to ensure those needs are met. I know this list looks like a bit of an alphabet soup, but it'll be worth our time to at least briefly talk about each of these concepts.
- RPO Recovery Point Objective
- RTO Recovery Time Objective
- MTTR Mean Time to Recovery
- MTBF Mean Time Before Failure
- MTTF Mean Time to Failure
The recovery point objective measures how much data loss you could sustain before full recovery becomes impossible. We're looking backwards here from the moment you restore operations. The point is that the world has continued spinning while your servers were sleeping. Customers were trying to access your site, but were unable to complete transactions. That represents a financial loss. But there might also have been existing customers who were actively using local versions of your application without realizing that the data they were generating wasn't being properly captured by your back-end.
So you want to know approximately how much data and revenue would you normally process in an hour, and how many hours' worth of data and revenue you could afford to lose. With that number, you'll be in a position to figure out how quickly your response and recovery plan will need to work to be useful.
That brings us to the recovery time objective: how long it will take you to go from a failed state to fully restored operations. If your RTO estimates for a particular outage type are, say, 12 hours, but your recovery point objective is only six hours, then you've got a problem. What's the use of an expensive recovery plan if you already know that your business will be dead and buried by the time it completes? If that's what your numbers tell you about your current expensive recovery plan, then you'd better start working on a more expensive - or more efficient - recovery plan.
If the RTO is a way of quantifying your target response time for a single hypothetical disaster event, the mean time to recovery gives us a sense of how your technology will actually behave in the real world. Your MTTR might be at least partly based on the performance guaranties provided by the vendors who supply the various hardware and software components that make up your stack. But I'd guess it'll primarily depend on the performance history of your emergency teams. Again: an MTTR that's longer than your RPO is a signal that you need to give this some attention.
Besides those metrics, it can also be helpful to think about failure. Of course, I don't mean how nasty things might get if you're the one who lets the rasomware attack authentication through. It might not be healthy to spend too much time obsessing over that one. Rather, I mean thinking about how often your systems are likely to fail.
But malware events don't happen on a schedule. And, perversely, they tend to erupt on long weekends or in the middle of the night. Here, you'll need to work with publicly-available incident data from governments and security companies. Given how many breaches go unreported, even those numbers will probably not get you too close to the real picture.
The goal will be to estimate how long your peer organizations go between major events. A "peer organization" is an organization that's largely similar to you in measures like size and industry type. In that context, the mean time before failure is generally used to measure recovery times from repairable failures. This can help you understand your tech stack - and particularly how each layer will work under stress within the whole.
By contrast, the mean time to failure focuses on *un*repairable failures that require full component or system replacement. By and large, the MTTF will be the metric that's the most interesting in the context of attack threats.
There'll obviously be some guessing with all this, but it's critically important to develop at least some informed sense of how you're likely to perform when things go crazy and how prepared you actually are to business continuity challenges.