Enter the Playbook - Crafting the Infosec Playbook (2015)

Crafting the Infosec Playbook (2015)

Chapter 5. Enter the Playbook

“Computers are useless. They can only give you answers.”

Pablo Picasso

Most large entities are faced with a crazy level of network and organizational complexity. Overlapping IP address space, acquisitions, extranet partners, and other interconnections among organizational and political issues breed complex IT requirements. Network security is inherently complicated with a large number of disparate data sources and types of security logs and events. At the same time, you’re collecting security event data like IDS alarms, antivirus logs, NetFlow records and alarms, client HTTP requests, server syslog, authentication logs, and many other valuable data sources. Beyond just those, you also have threat intelligence sources from the broader security community, as well as in-house-developed security knowledge and other indicators of hacking and compromise. With such a broad landscape of security data sources and knowledge, the natural tendency is toward complex monitoring systems.

Because complexity is the enemy of reliability and maintainability, something must be done to combat the inexorable drift. The playbook is an answer to this complexity. At its heart are a collection of “plays,” which are effectively custom reports generated from a set of data sources. What makes plays so useful is that they are not only complex queries or code to find “bad stuff,” but also self-contained, fully documented, prescriptive procedures for finding and responding to undesired activity.

By building the documentation and instructions into the play, we have directly coupled the motivation for the play, how to analyze it, the specific machine query for it, and any additional information needed to both run the play and act upon the report results. Keep in mind, however, that the playbook isn’t just a collection of reports, but a series of repeatable and predictable methods intended to elicit a specific response to an event or incident.

For our framework design, every play contains a basic set of mandatory high-level sections:

§ Report identification

§ Objective statement

§ Result analysis

§ Data query/code

§ Analyst comments/notes

The following sections detail our requirements and definitions for analysts to create additional playbook reports. It’s certainly possible to have additional sections depending on your end goal; however, for our purposes of incident response we’ve determined that the sections just outlined are the most precise and effective, without collecting superfluous information.

Report Identification

Our reports are identified in short form by a unique ID, and in long form by a set of indicators that give context about what the report should accomplish. The long form is formulaic and amounts to the following:

{$UNIQUE_ID}-{HF,INV}-{$EVENTSOURCE}-{$REPORT_CATEGORY}: $DESCRIPTION

{$UNIQUE_ID}

Our report identification (ID) numbers use a Dewey Decimal-like numbering system where the leading digit indicates the data source (Table 5-1).

Unique ID range

Event source

Abbreviation

0–99999

Reserved

N/A

100,000–199,999

IPS

IPS

200,000–299,999

NetFlow

FLOW

300,000–399,999

Web proxy

HTTP

400,000–499,999

Antivirus

AV

500,000–599,999

Host IPS

HIPS

600,000–699,999

DNS sinkhole and RPZ

RPZ

700,000–799,999

Syslog

SYSLOG

800,000–899,999

Multiple event sources

MULTI

Table 5-1. Playbook report identification numbers

We include an event source tag and event source number in each play for easier grouping, sorting, and human readability. It also allows us to run easier queries for metrics against a particular event source to include the numeric ID. We’ve padded several digits after the leading digits with 0s for room for expansion and subcategories for future data sources and feeds. The remaining portion of the report ID is a unique, mostly incrementing report number. Providing a number to each report, and assigning it to a class organizes the results. Well-organized reports make it easier to understand visually and enable better analysis. The reports can be easily sorted and the numbers provide for additional operations later in the incident response process, like reporting and metrics. If we were to add an additional host-based product to our detection arsenal, we can easily fold it under the 500,000 range—perhaps in a 501,000 series. It’s not likely we would have 1,000 reports for one event source, so the padding is adequate.

{HF,INV}

The next portion of the report identification is the report type, which is currently either “investigative” or “high fidelity.”

High fidelity means that all events from a report:

§ Can be automatically processed

§ Can’t be triggered by normal or benign activity

§ Indicate an infection that requires remediation, not necessarily a policy violation

Investigative means that any event from a report might:

§ Detail a host infection

§ Detail a policy violation

§ Trigger on normal activity (which may require tuning)

§ Require additional queries across other event sources to confirm the activity

§ Lead to development of a more high-fidelity report

Our level of analysis depends on a very simple rule: a report is either high fidelity or it isn’t.

High fidelity means that all the events from a report or query are unequivocal indicators—that is to say a “smoking gun” for a security incident. It’s proof beyond a reasonable doubt versus a preponderance of evidence (including circumstantial evidence). In our system, high-fidelity incidents automatically move on to the remediation step of the incident-handling process. Hardcoded strings, known hostnames or IPs, or regular expressions that match a particular exploit are good examples of things that can be included in a high-fidelity report. However, the reports that make up the vast majority of the playbook are not high fidelity. Only about 15% of our reports are high fidelity, yet those reports make up the bulk (90%) of the typical malware infections we detect.

Reports that cannot indicate with 100% certainty that an event is malicious are deemed “investigative.” More investigation is required against the events to determine if there’s truly a security incident or a potentially tunable false positive. The investigation may result in a true positive, a false positive, an inconclusive dead end, or it may lead to the creation of additional investigative reports to further refine the investigation. Investigative reports that mature through tuning and analysis could eventually become high fidelity if we can confidently remove all nonevents.

{$EVENTSOURCE}

The event source identifies which source, or sources, that the report queries. The leading digit of the report ID will always correlate with the event source, which can be seen in Table 5-1.

{$REPORT_CATEGORY}

We’ve developed report categories that apply to the types of reporting we’ve achieved (Table 5-2). Keep in mind that you might consider choosing similar categories that align with or are exactly those prescribed by other organizations. The Verizon Vocabulary for Event Recording and Incident Sharing (VERIS), as well as the United States Computer Emergency Readiness Team (US-CERT) and others, have standard categories of incident you can use to compare metrics.

Category

Description

TREND

Indicators of malicious or suspicious activity over time and outliers to normal alerting patterns and flows

TARGET

Directed toward logically separate groups of networks and/or employees (e.g., extranet partners, VIP, business units)

MALWARE

Malicious activity or indicators of malicious activity on a system or network

SUSPECT_EVENT

Indicators of malicious or suspicious activity that require additional investigation and analysis

HOT_THREAT

Temporary report run with higher regularity and priority to detect new, widespread, or potentially damaging activity

POLICY

Detection of policy violations that require CSIRT response (IP, PII, etc.)

APT

Advanced attacks requiring special incident response

SPECIAL_EVENT

Temporary report run with higher regularity and priority for CSIRT special event monitoring (i.e., conferences, symposia)

Table 5-2. Playbook report categories

{$DESCRIPTION}

The free-text description component to the report title provides a brief summary of what the report attempts to detect. For example:

500002-INV-HIPS-MALWARE: Detect surreptitious / malicious use of

machines for Bitcoin mining

This report name tells the analysts its unique ID is 500002. The leading 5 in the ID indicates the report searches HIPS data. It’s an investigative report, which will require analyst resources to confirm that the implicated host has unauthorized Bitcoin mining software installed.

Objective Statement

The objective statement describes the “what” and “why” of a play. Experience has taught us that as queries are updated, they can drift from the original intention into an unintelligible mess without a good objective statement. The target audience for the objective statement is not a security engineer. The objective statements are intended to provide background information and good reasoning for why the play exists. Ultimately, the goal of the objective statement is to describe to a layperson what a play is looking for on the network and leave them with a basic understanding of why the play is worthwhile to run. It should be obvious to the analyst why this report is necessary, and it should meet at least one of the following criteria:

§ Tells us about infected systems (bots, Trojans, worms, etc.)

§ Tells us about suspicious network activity (scanning, odd network traffic)

§ Finds unexpected/unauthorized authentication attempts to machines

§ Provides summary information, including trends, statistics, counts

§ Gives us custom views into certain environments (targeted reports, critical assets, hot-threat, special event, etc.)

The important thing to keep in mind for an objective statement boils down to Is this the best way to find the information, and if so, how can it best be presented?

The following example objective shares a high-level overview of the issue that requires the report to be scheduled and analyzed:

SAMPLE OBJECTIVE

Today, malware is a business. Infecting machines is usually just a means to financial ends. Some malware sends spam, some steals credit card information, some just displays advertisements. Ultimately, the malware authors need a way of making money by compromising systems.

With the advent of Bitcoin, there is now an easy way for malware authors to directly and anonymously make use of the computing power of infected machines for profit.

Our HIPS logs contain suspicious network connections, which allow for the detection of Bitcoin P2P activity on hosts. This report looks for processes that appear to be participating in the Bitcoin network that don’t obviously announce that they are Bitcoin miners.

Result Analysis

The result analysis section is written for a junior-level security engineer and provides the bulk of the documentation and training material needed to understand how the data query works, why it’s written the way it is, and most importantly, how to interpret and act upon the results of the query. This section discusses the fidelity of the query, what expected true positive results look like, the likely sources of false positives, and how to prioritize the analysis and skip over the false positives. The analysis section can vary a lot from play to play because it’s very specific to the data source, how the query works, and what the report is looking for.

The main goal of the analysis section is to help the security engineer running the play and looking at report results to act on the data. To facilitate smooth handling of escalations when actionable results are found, the analysis section must be as prescriptive and insightful as possible. It must describe what to do, all of the related/interested parties involved in an escalation, and any other special-handling procedure.

For high-fidelity plays, every result is guaranteed to be a true positive so the analysis section focuses more on what to do with the results rather than the analysis of them. As we mentioned, the vast majority of reports are investigative, and therefore require significant effort to ensure they are analyzed properly. The following sidebar shows what a thorough analysis section might look like.

SAMPLE ANALYSIS

This report is fairly accurate. Bitcoin operators use port 3333, which is rather unique. The report simply looks for running processes talking outbound on port 3333/TCP. A few IPs known to host services on 3333 have been excluded from the query, as have the names of some processes like “uTorrent” that are somewhat likely to generate false positives.

Most of the results produced by this query are obviously malicious. For example:

2013-08-09 11:30:01 -0700|mypc-WS|10.10.10.50|

The process 'C:\AMD\lsass\WmiPrvCv.exe' (as user DOMAIN\mypc) attempted

to initiate a connection as a client on TCP port 3333 to 144.76.52.43

using interface Wifi. The operation was allowed by default (rule defaults).

And:

2013-08-07 22:10:01 -0700|yourpc-WS|10.10.10.59|

The process 'C:\Users\yourpc\AppData\Local\Temp\iswizard\dwm.exe' (as user

DOMAIN\yourpc) attempted to initiate a connection as a client on TCP port

3333 to 50.31.189.46 using interface Wifi. The operation was allowed by

default (rule defaults).

There are also programs that use Bitcoin as a way to pay for the service:

2013-08-08 01:10:01 -0700|theirpc-WS|10.10.10.53|

The process 'C:\Program Files (x86)\Smart Compute\Researcher\scbc.exe'

(as user DOMAIN\theirpc) attempted to initiate a connection as a client

on TCP port 3333 to 54.225.74.16 using interface Wifi. The operation was

allowed by default (rule defaults).

For analysis:

§ If you want to confirm the IP being communicated with is actually involved in Bitcoin transactions, simply Google the IP along with the word “bitcoin.” There are many services that list all bitcoin nodes and bitcoin transactions.

§ For internal <-> internal traffic on port 3333/tcp, the alert is almost always a false positive triggered by someone internally picking port 3333 to run a service. Real Bitcoin activity should always involve internal <-> external traffic on port 3333/tcp.

§ For processes that look malicious, send the host for remediation (re-imaging).

§ For processes that are semi-legitimate, like “Smart Compute\Researcher\scbc.exe”, contact the user and inform them they must uninstall the software. See the internal Acceptable Use Policy for more information.

§ See http://www.smartcompute.com/about-us/ for details on the software.

§ For the few cases where the 3333 traffic isn’t Bitcoin related, or where it isn’t easy to tell if the mining is malicious, simply ignore the results.

Data Query/Code

The query portion of the play is not designed to be stand-alone or portable. The query is what implements the objective and produces the report results, but the specifics of how it does that just don’t matter. All of the details of the query needed to understand the results are documented in the analysis section. Any remaining under-the-hood details are inconsequential to the play and the analyst processing the report results. Queries can sometimes be rather complex due in part to being specific to whatever system the data lives in.

We’ll cover query development in depth in Chapters 8 and 9. The primary reason we include the query in the report, aside from the obvious need to use it, is that we want to ensure our play tracking and development system is in sync with our log management and query system, and to help educate each other with creative methods for developing queries. Analysts can often reuse logic and techniques from queries of already approved plays.

Analyst Comments/Notes

We manage our playbook using Bugzilla. Using a bug/ticket tracking system like Bugzilla allows us to track changes and document the motivation for those changes. Any additional useful details of a play that don’t belong in the aforementioned sections end up in the comments section. For a given objective, there are often a number of ways to tackle the idea in the form of a data query. The comments allow for discussion among the security engineers about various query options and the best way to approach the play objective. The comments also provide a place for clarifications and remarks about issues with the query or various gotchas.

Most plays need occasional maintenance and tuning to better handle edge cases and tune out noise or false positives. The comments allow the analysts processing reports to discuss tweaks and describe what is and isn’t working about a report. By keeping all of the notes about a play as addendums, it’s possible to read the evolution of the play. This enables us to keep the playbook relevant long term. It also provides for additional management options like retiring reports and reopening reports.

The Framework Is Complete—Now What?

We have talked with plenty of security teams from different industries all around the world. Many of them have figured out mature approaches that work to secure their networks. Many more just want us to give them the playbook as though it’s a drop-in solution. The framework as we’ve just defined it is the playbook. We’ve put together a straightforward framework based on our experiences with incident detection, our current tools and capabilities, our team structure and expertise, and our management directive.

The framework stands well on its own, but at some point, plans must be put to action. After you’ve fine-tuned your plan to clean up data so it can be searched, and developed a democratic way of detecting current and future threats, it’s time to put your methods into practice. Security operations depend heavily on solid process, but good security operations also depend on effective and sustained threat detection. The playbook requires regular maintenance once you add in the operational moving parts like analysis, play review, threat research, and the like.

Chapter Summary

§ Developing a playbook framework makes future analysis modular and extensible.

§ A thorough playbook will contain the following at minimum:

§ Report identification

§ Objective statement

§ Result analysis

§ Data query/code

§ Analyst comments/notes

§ An organized playbook offers significant long-term advantages for staying relevant and effective—focus on organization and clarity within your own framework.