If something bad happens to your business, you need to recover quickly. The best way to be able to do this is with lots of preparation and planning before the bad event happens.
Plan for what is difficult while it is easy. (Sun Tzu)
Incident response is handling when things aren’t operating correctly, for known or unknown reasons, and determining the what why and where of the problem. Disaster recovery is the handling of a bigger problem (a “disaster”) while also maintaining continuity of operations.
This is a continuation of my blog post series on the CompTIA Security+ exam.
Incident Response Plan
This plan is the set of steps an organization performs in response to any situation that is abnormal with regard to operation of computer systems. How you react depends on two things: the information criticality, and whether the incident affects other operations in the organization.
Documented Incident Types and Category Definitions
Defining a set of categories and types of incidents helps planners and responders know how to react. Generally, you’d have a set number of scripts that can be applied quickly to a given situation. This helps minimize confusion and repetition. Example categories include interruption of service, malware delivery, phishing attacks, data exfiltration, and so on.
Roles and Responsibilities
You also want to have the roles and responsibilities defined ahead of time. The “cyber incident” response team will consist of subject matter experts, a team leader, and a communicator. Defining who has permission to do what helps streamline the process when an incident occurs.
You also want to plan out the desired reporting requirements. This includes escalation. How do you determine whether something should be escalated? Who do you talk to? Who needs to be involved?
Your cyber-incident response team should have a predefined list of members, and backup members. Again, the idea is to do as much planning as possible beforehand. The team leader needs to have IT experience, and also be involved in the business side of things, so they understand the impact of technical decisions.
Once you’ve got a team and a plan, test the plan out! Do regular exercises to make sure that the plan works, and improve on it.
Incident Response Process
This is the set of actions that security personnel perform in response to a wide range of incidents. There are six different phases: preparation, identification, containment, eradication, recovery and lessons learned.
- Preparation is the first phase, and it happens before the incident. It covers all the planning discussed above.
- Identification is when a team member thinks an incident has occurred, and notifies the IR team for further investigation. The IR team processes information and determines whether to involve a response to the incident.
- Containment is the set of actions to constrain the incident to the smallest number of machines. This might be disconnecting servers, etc.
- Eradication is the removal of a problem, usually while still contained. You also want to prevent reinfection.
- Recovery is the process of returning the affected assets back into their normal business functions, and restoring normal business operations.
- After everything is taken care of, you want to do a lessons-learned session to determine what went well and what didn’t. This helps improve the processes for next time.
Disaster Recovery
Disasters are worse than the incidents discussed earlier in this blog post. These often disrupt operations for some length of time. Disaster recovery is the process of recovering from events that disrupt normal operations.
Recovery Sites
Depending on the extent of the damage, you might want to get a recovery site. This is related to the location of your backups. Even though your backups are safe, you still need to do something with that data to continue business operations. That’s where a recovery site comes in (until normal operations are restored). Sites can be hot, warm or cold.
- Hot sites are fully configured environments that are ready almost immediately. Has backups that are ready or nearly ready to use.
- Warm sites are partially configured. Might take a few days to get up and running. Likely have older backups.
- Cold sites have the basics, but not much more. You likely won’t have any backups, or most of the equipment you need. It might take weeks to get back up and running again.
When you do restore operations, what do you restore first? Figure out what your dependencies are. Then figure out what is most critical to the organization.
Backups
Availability of backups is very important. You need to consider things ahead of time, like how frequently to back things up, how extensive the backups need to be, who is responsible, where to store them, and so on.
The purpose of a backup is to provide valid, uncorrupted data in the event of corruption or loss to the original data.
Backups can be:
- Differential: save only files that have changed since the last full backup.
- Incremental: save files that have changed since the last full backup, or the last incremental backup.
- Snapshots: a copy of a VM
- Full: a complete copy of a machine’s data
Geographic Considerations
You should keep backups in separate locations. You should have the most recent copy locally, but other copies far enough away to be safe from local disasters, but close enough to restore quickly.
Off-site backups are backups at a separate physical location than the original data. Again, distance can create a logistics problem if you go too far with it. When considering a location, first consider the physical safety of the backups. Then consider ability to move the backups in and out of storage.
You might also be subject to legal restrictions that determine where and how you can store backups. In addition to this, some countries have the concept of data sovereignty, which says that data stored within a country’s borders is subject to their laws.
Continuity of Operation Planning
The overall goal of continuity of operation planning is to determine which subset of normal operations need to be continued during periods of disruption. This is a comprehensive plan that will be enacted, and requires you to first identify critical assets and personnel, critical systems, and interdependencies.
When you move from a normal operation capability to a continuity-of-operations subset of business operations, this is called failover. Determine ahead of time what kind of switchover timeline you need, and plan around that. The difference in business practices during this time is known as alternate business practices.
As with other plans, it’s good to validate them through tabletop exercises and other tests. Likewise, after you’ve executed a plan, write up what went well and what didn’t in an after-action report. Additionally, you might need alternate processing sites (also discussed earlier).