I’ve Got Incidents Now! How Do I Respond? - Crafting the Infosec Playbook (2015)

Crafting the Infosec Playbook (2015)

Chapter 10. I’ve Got Incidents Now! How Do I Respond?

“We kill people based on metadata.”

General Michael Hayden, former Director of NSA

Up to this point, we’ve explained how to understand threats, how to build and operate a security monitoring system, and how to creatively discover security incidents through log analysis and playbook development. With your well-oiled detection machine, you will discover incidents and new threats, while the team fields incident notifications from employees and external entities alike. Your analysts are researching and creating plays, investigating incidents, and sorting out false positives from confirmed malicious behavior, based on techniques from your playbook. However, an incident response playbook is more than just detection. It must also include instructions on how to respond.

We have discussed a structured approach to prepare for, detect, and analyze malicious behavior. Yet despite the effort involved in the detection phase, it is only the beginning of the incident response lifecycle process. After detecting an incident, the next most important step is to contain the problem and minimize the damage potential to your organization. After all, a key factor in an overall security strategy is to build a monitoring system and playbook to thwart security incidents as soon as possible to reduce downtime and data loss. After an incident has been triaged and the bleeding stopped, it’s time to clean up the original problem. Remediation demands that you not only undo the work of the attacker (e.g., removing malware from a system, restoring a defaced website or files from backup), but that you also develop a plan to prevent similar incidents from happening in the future. Without a plan to prevent the same problems, you run the risk of repeat incidents and further complications to the organization, weakening your detection and prevention efforts.

For analysts to be successful in preventing computer security incidents from wreaking havoc with your network and data, it is imperative that you ensure consistent and thorough incident-handling procedures. A playbook for detection and analysis coupled with an incident response handbook for response methods provide consistent instructions and guidelines on how to act in the event of a security threat. Just as firefighters know not to turn their water hoses on a grease fire, your incident response team should know what to do, and what not to do, during a security incident.

In this chapter, we’ll cover the response side of the playbook, specifically:

Preparation

How to create and activate a threat response model.

Containment (mitigation)

How to stop attacks after they have been detected, as well as how to pick up the pieces.

Remediation

When to investigate the root cause, what to do once identified, and who is responsible for fixing it within the organization.

Long-term fixes

How to use lessons learned to prevent future similar occurrences.

Shore Up the Defenses

In Figure 10-1, the hexagons on the right side of the diagram show the primary response functions of the incident response team once an incident becomes known. We also see that the source of an incident can come from many locations, like internal tools (the playbook) and employees or external entities. Measuring the number of incidents detected internally versus those reported by external groups offers a view into a team’s response time and efficacy. There are external sources like MyNetwatchman, Spamhaus, SANS, and many others that will notify an organization if they are hosting compromised (and actively attacking) hosts. A higher ratio of internally detected incidents means the team is doing a better and faster job at detecting attacks. If you depend on external entities to inform you of security breaches, you are already far behind in the response process and need to improve your detection capabilities.

Additionally, long-term tasks such as patching vulnerabilities, fixing flawed code, and developing or amending policies to prevent future attacks are all part of the process. During a major incident, the response team’s role is mitigation, coordination, consultation, and documentation:

Mitigation

Blocking the attack to stop any further problems for occurring as a result of the incident.

Coordination

Managing the entire incident response process, as well as ensuring each stakeholder understands the roles, responsibilities, and expectations.

Consultation

Providing technical analysis to relevant stakeholders (system owners, executives, PR, etc.) and suggesting long-term fixes leveraging any relevant IT experience.

Documentation

Incident reporting, tracking, and dissemination where appropriate.

CERT incident response lifecycle

Figure 10-1. Incident response lifecycle

Organizations with small IT staff may find it easier to coordinate a response to a security incident. The closer an incident response team is to various IT organizations, the faster information can be shared and a long-term fix implemented. However, in a large organization, particularly those with various business units and partners, the response time can be much slower. While the InfoSec organization should have tight integration with all IT teams across an organization, unfortunately, this is not always the case. During the height of an incident, IT teams must trust and defer to the incident response team during the triage, short-term fix, and notification phases of the incident. However, the incident response team has to rely on the IT team—experts with their own systems and architecture—in order to solve bigger issues that may lead to additional incidents.

Major incidents require cooperation between those who are adept in handling a crisis and those who are in a position to understand the ramifications of any short- and long-term fixes. In Chapter 1, we discussed the various relationships that a CSIRT must maintain, and highlighted how relations with IT teams are paramount. Nothing is more certain than when IT infrastructure has been compromised.

Lockdown

A networked worm provides the perfect example of why containment is so important. If left unchecked, a worm, by design, will continue to spread to any and all vulnerable systems. If the “patient zero” system (and its subsequent victims) isn’t cordoned off, it will attack as many other systems as possible, leading to exponential growth and intractable problems. This quickly spirals out of control from a single malware infection to a potential network meltdown.

In similar fashion, malware designed to perform DoS attacks can easily clog your network pipes with volumes of attack data destined toward external victims, not only creating problems for system resiliency, but also damaging your online reputation by making you complicit in attacks against your will. Additionally, you may find your organization as the target of a DoS or distributed denial of service (DDoS) attack.

Responding to these two types of attacks by blocking the source IP address or addresses has its place in your incident response toolkit, but it should be used with a word of caution. Blocking individual hosts is a tedious and potentially time-consuming process that may have no end. We refer to this as the whack-a-mole approach. For large-scale DDoS attacks, you may be the target of more attackers than you can reasonably block one-by-one, or a few at a time. Additionally, blocking all communications from an address or subnet may have the unintended consequence of also blocking legitimate traffic from the same addresses. Further, even if the attacks come from a finite number of source hosts that are manageable for you to block, the sources are often dynamic or pooled addresses that may in the future no longer be the source of malicious activity. Depending on the type of the attack, there may be better solutions by requesting blocking or rerouting assistance at the ISP level. In either case, you are faced with the questions: do you have a process to review or expire historical IP blocks, or will you continue to block unless you receive a complaint regarding service unavailability?

As mentioned in Chapters 8 and 9, the playbook’s analysis section must include specific directives on how to interpret the result data, as well as how to properly respond to each event. Depending on the type of incident, it should include details on how to properly contain the problem. The methods for containment differ for incidents related to employee data theft or abuse, as opposed to malware outbreaks on networked systems. For insider threat incidents such as document smuggling, sabotage, and some abusive behaviors, the best remediation option may be to suspend or terminate account access and inform Human Resources or Employee Relations. Preventing disgruntled employees from logging in to their email, VPN, or other computer systems decreases their ability to cause damage or to steal additional confidential information.

The Fifth Estate

Responding to and containing security incidents is more than cleaning up after malware or disgruntled employees. This rings especially true if an incident deals with the loss of customer data and privacy. Not only does an organization have to deal with the court of public opinion in terms of loss of reputation, but there may be legal ramifications as well. Many countries and states have mandatory disclosure notification laws. To protect citizens, or at least inform them of their potential privacy loss at the hands of an organization, many laws demand that consumers and customers are notified in the event that their formerly private data has now been exposed.

Containing an incident that has gone public can be difficult and should only be done in concert with public relations, any legal entities representing the organization, and executive leadership. The role of the incident response team in a public-facing crisis means being able to provide irrefutable facts to people representing the organization. Don’t let anyone other than public relations-vetted people speak with the media or external entities regarding the incident unless otherwise approved. Sharing too much detail or incorrect detail can lead to worse image problems than the original incident.

After his job was terminated, a network administrator for the city of San Francisco held hostage the root passwords to the entire city’s networking infrastructure. Being the only person with the correct password meant that the IT infrastructure was completely frozen by this disgruntled employee, and at his mercy, until either he revealed the passwords, or the network was rebuilt—an option more expensive and complex than imaginable. From a detection standpoint, a play for monitoring password changes of admin accounts, unusually timed logins, or authentication to critical systems, might have tipped off the incident response team that something was afoul. However, in this case, the admin never changed passwords—he was simply the sole proprietor of them. Had the incident response team been notified in advance that the network administrator was on notice or soon to be let go, the team could have immediately suspended his accounts before too much damage was done. For incidents relating to the members of your organization, a partnership with human resources allows the incident response team to (hopefully) proactively mitigate a threat, rather than reactively address a preventable incident.

Advanced and targeted attacks by their nature are much harder to contain. This makes them the most important attacks to focus on. Containment options for advanced attacks range from removing network connectivity from a compromised system or remote host, to locking out users and resetting passwords for known affected accounts, to blocking protocols or even shutting down the entire organization’s Internet connection. Remember that it’s not possible to block 100% of the attacks 100% of the time. Advanced attackers may successfully intrude your network undetected—at least initially—until you discover them and then update your plays with fresh indicators. The important thing to remember is that even though you will not discover every incident, you still need to be prepared to respond to the worst-case scenario.

These targeted attacks can also be difficult to control and contain if they have been exposed publicly. The public exposure of private details, or doxing, of an organization’s employees or leadership can be disastrous, and could be exploited or abused in a number of ways. If the news of the attack on your organization is trending on Twitter, or regularly covered by the mainstream or tech press, it will be hard to put the genie back in the bottle. If the attack is high profile, it is also possible that an organization may call in additional resources to assist in the investigation and containment. This can relieve some of the pressure from the incident response team who can focus on the root causes, the extent and type of damage, active containment, business continuity, and improved security architecture.

No Route for You

When it comes to networked systems, the containment problem may be a little easier to solve. There are numerous options to mitigate an incident at the network and system level. In most cases, blocking network connectivity is the best option to allow for further forensic activity and may often be the only option for a non-IT managed system. Adding a MAC address to a quarantined VLAN or network segment can prevent any damage to the rest of the organization. A new 802.1x access control policy, firewall policy, or a simple extended network ACL can also limit the connectivity of a misbehaving device. However, this approach can present challenges. Most organizations adhere to some type of change control policy, meaning that modifications to critical infrastructure like routers, switches, and firewalls are only permitted during designated and recognized change windows. These windows limit the available time during which a routine ACL modification can be made. Additionally, new access control entries or firewall deny statements can potentially introduce instability if there are errors on entry (e.g., wrong subnet or other typo), or if a compromised host’s traffic is so voluminous that packet filtering devices like firewalls and routers are under heavy CPU loads already and stagger under the increased CPU load of applying the new ACL.

Null routing, or blackholing, offers a more palatable immediate mitigation option that can be introduced without concern for further network degradation. There are two types of blackholing—source and destination. Destination blackholing works well in situations like cutting off reply traffic to Internet C2 servers by dropping traffic destined to a particular IP address. With destination blackholing, a router sets the next-hop for a host to a static route that points to null0, or a hardware void. Any connections destined to the blackholed host will be dropped when it reaches a routing device with the null route.

NOTE

Source blackholing, on the other hand, works well for incidents like worm outbreaks or DoS attacks, where you need to block all traffic from an IP address, either internal or external, regardless of the destination. Because routers are inherently poor at verifying connection sources, you can spoof TCP and UDP packets and create major problems. Unicast Reverse Path Forwarding (uRPF), a type of reverse path forwarding (RPF), is a required solution to the source verification problem. When a routing device receives a packet, uRPF verifies that the device has a route to the source of the connection from the receiving interface (strict mode uRPF), or in a network with asymmetric routing, from the routing table (loose mode uRPF).

If the routing device doesn’t have a route to the source (e.g., spoofed addresses) or if the return route to the source is to null0 (blackhole), the packet will be dropped. In this way, a loose mode uRPF blackholed address provides a feasible way to drop traffic both to and from the address.

The effect of a blackhole can be quickly propagated throughout your network using iBGP and a Remotely Triggered Blackhole (RTBH) router. By peering via iBGP with other routers in your network from a trigger router, you can announce the new null0 from the RTBH, and within a few seconds, blackhole an address across the organization (Figures 10-2 and 10-3). Because a packet will be routed in your network until it matches the null route, and because iBGP is not used on every routing device in a network, you should be aware where in your network topography the null route is applied.

Internal blackhole route architecture

Figure 10-2. Malicious communication without black hole route

Internal blackhole route architecture

Figure 10-3. With null routing, malicious traffic can be blocked quickly without the hassle or scaling limitations associated with ACL management

Not Your Bailiwick

Another option for containing an infected host relies upon the foundational Internet protocol of the DNS. As discussed in Chapter 7, collecting DNS traffic can be a boon to your security monitoring operations. However, DNS can also be useful from a mitigation standpoint. RPZ can help you create the equivalent of a DNS firewall. This DNS firewall gives you four different types of policy triggers for controlling what domains can be resolved by your clients. If blocking external C2 systems by IP address isn’t flexible enough, you can define a policy based on the name being queried by clients (QNAME policy trigger) to control how your recursive name servers handle the query. If policies based on the QNAME aren’t enough, DNS RPZ also gives you more powerful policy controls, like blocking all domains that resolve to a certain IP address (IP address policy trigger). If you need to target a malicious name server itself, RPZ offers policy triggers either by the name server’s name (NSDNAME policy trigger), or even the name server’s IP address (NSIP policy trigger). RPZ policies go into a special zone that allows you to be authoritative for any domain in an easy-to-manage way. RPZ is especially powerful because of the response flexibility. Standard blocking by returning NXDOMAIN records is possible, but RPZ also allows forged responses that can redirect your clients to internal names if you need to capture more information about their activities to external domains.

Lastly, RPZ is only supported in the BIND DNS server, and does not work with Microsoft DNS Services. Microsoft DNS servers can point to authoritative BIND DNS servers to use RPZ capabilities; however, there’s no current method to add hosts to a response zone on the Microsoft DNS server itself.

One Potato, Two Potato, Three Potato, Yours

Knowing when and why to hand off an incident relies on a common understanding in your organization of roles and responsibilities. While often a football defender may advance to the goal area for a scoring opportunity, the team ultimately needs that player back on defense before the next offensive play by the opposing team. A player’s skillset should match the position they play, and stretching their talents into other roles affects the quality and capacity to do efficient work in their primary role. Even though a security investigator may have many of the same skills as a security architect, they cannot assume the responsibility of the long-term remediation plan for every incident.

Yet, one of the outputs of successful incident response is tangible: written evidence of architecture gaps and failures. A CSIRT should be able to understand the underlying problem and advise any security architects and engineers on appropriate and acceptable solutions based on incident details. Like the players on a football team, these security individuals must communicate about current threat landscapes, trends, and incidents. Only when all individual contributors are participating and performing the roles to which they are assigned, can an incident successfully move through the incident lifecycle, ultimately being remediated both in the short and long term.

Get to the Point

Handing off an incident or even closing one demands that an investigator decide whether or not to perform an exhaustive root cause analysis (RCA) to determine precisely why an incident occurred. In many cases, an RCA is simply mandatory. If a critical system is compromised, it’s imperative to discover precisely what conditions allowed the attack to succeed so they can be avoided in the future. However, there are other situations where dedicating significant time to RCA is deleterious to other work. For example, if your organization has been scanned and exploit attempts have been launched by external organizations toward your web presence, it’s not necessary to follow up on every “attack” if there is no additional evidence indicating that it was successful. The incident response team could spend a lot of time attempting to perform threat actor attribution to these types of attacks, but everyone knows that being on the Internet means you will be portscanned and attacked. Proper attribution isn’t even reliable, given that attackers can come in from other hacked systems, proxies, or VPN addresses, masking their true origin.

For most organizations, knowing exactly who scanned your website, or sent a commonly used trojan via phishing doesn’t necessarily yield any actionable results and may not even be considered incidents. Don’t be flattered or concerned if you get scanned—it may not be targeted. We have to accept that malware is ubiquitous and ranges from annoying adware to sophisticated remote control and spying software. Researching every exploit attempt, probe, or malware sample doesn’t justify the time investment required. Of course, malware that’s not initially detected or deleted by host-based controls, is clearly targeted, or has other peculiar and unique characteristics will be worth investigating to discover additional indicators.

Portable computer systems (e.g., laptops, smartphones, and tablets) will join all types of uncontrolled external networks, from home LANs to coffee shops, bars, airplanes, or any number of public or private places. Without proper host controls like “always-on” HIPS, it can be difficult to even understand the source of a compromise outside the borders of your security monitoring. Not knowing the source of an attack or malware obfuscates the ability to block it in the future from within the organization, and makes determining the root cause extremely difficult, if not impossible. Additionally, there may not be log data available to accurately track each step leading up to a security incident. Most security event sources only alert on anomalous activity, while not reporting normal behaviors. If a series of normal behaviors led up to an eventual incident, understanding what created the conditions leading to the compromise will go unnoticed and unlogged.

Many RCAs come to dead ends simply due to insufficient details when revisiting the full attack and compromise cycle. This is why attempting to determine the source of every malicious binary dropped on a host often requires more effort than it’s worth.

Lessons Learned

In the end, after the dust has settled on the incident and the bleeding has stopped, it will be time to develop a long-term strategy for avoiding similar issues. It’s also the best time to refine your monitoring techniques and responses. Mistakes or inadequacies uncovered in the course of the incident response process will yield opportunities to improve in the future.

Consider this plausible, if not extreme, example. It’s the middle of the day, and you’re on call again when the incident response team receives an internal notification from an application developer regarding a newly discovered and unknown web application in a production web app server’s webroot directory. Browsing to the web application, you’re presented with a simple text box and a run button. Entering random text and hitting run surprisingly displays the text -bash: asdf: command not found. Bash? On the external web server? Quickly, you try a process listing, and the app generates a full ps output. Among system daemons in the process listing are Apache and Tomcat processes. Dreadfully, this output shows the web server processes are running as root. Fearing the worst, you run whoami through the web application only to have those fears confirmed when the output displays root.

The scope of the issue hits you—unauthenticated Internet-facing root access to one of your production web servers. You’ll remediate the issue by taking the service offline, contain the issue after quantifying the scope of affected hosts, and perform RCA to determine how the hosts were compromised and who executed what commands via the shell. During the post-mortem, you ask yourselves what happened during this incident that had we known about it previously we could have detected, mitigated, or contained the incident sooner?

From a detection standpoint, you might identify the availability and usefulness of logs generated from any of the application, the web app server, the web server, or the operating system itself. It would be reasonable for the web app server to generate an event when a new web application is deployed, or for the web server to provide attacker attribution information identifying when the shell was accessed and by whom.

During the incident investigation, you learned about many longer-term infrastructure improvements that will likely require dedicated resources, tracking, and commitment to appropriately fix. If, in the previous example, your entire infrastructure runs the Tomcat systems as root, then you need to harden the systems, change the build process, and QA your applications on the now lower privilege application server. In a rush-to-market environment, these incident findings will require much more time and many more people to implement than is feasible to manage in your incident. Security architecture, system administration, and application development engineering all play a part to fully address the identified weaknesses, none of which reside solely in the domain of security monitoring or incident response. As an incident responder, how do you identify the responsible parties in your organization, get commitment from those parties that they will address weaknesses in their domain, and then ensure follow-through on the commitments?

By further analyzing this example against the incident lifecycle, you may learn of other weaknesses that need attention. Were you able to attribute the application or system to an actual owner via an asset management system? Could you quickly determine the scope of the incident via comprehensive vulnerability scanning? Does your organization have a scalable method of containing the incident when the problem exists across hundreds of hosts in the production environment? While your incident response team may have unearthed the issues and have a stake in their resolution, managing the issues to completion is not a core function of monitoring or response and therefore should be handled by the appropriate team within your organization.

Incidents that have already been resolved provide a wealth of detail within their documentation. Keeping accurate and useful records when investigating an incident will pay off in the future with newer team members, as well as providing historical and verified information that can help affect positive change for security architecture enhancements. Beyond this, any new procedures you used, contacts you required, or old processes you found to be inefficient or unusable should make their way into the incident response handbook. Keeping the incident response handbook alive with updated information will speed up responses to future incidents that demand similar tactics. Depending on the type of incident and how it was detected, there may also be a place for updating the analysis section of your playbook to more effectively respond to future incidents.

Chapter Summary

§ After detection, the incident response team’s role is to mitigate, coordinate, consult, and document.

§ Having a reliable playbook and effective operations will lead to incident discovery.

§ You need to ensure your response processes are well tested and agile in the event of a major incident.

§ Mitigation systems are crucial for containing incidents and preventing further damage. DNS RPZ and BGP blackholing are excellent tools for cutting off basic network connectivity.

§ An incident can be over in a few minutes or can take weeks to resolve. Long-term fixes may be part of the solution, but the incident response team should be prepared to consult and assist, rather than drive systematic and architectural changes