Complex Event Processing for Automated Security Event Analysis - OPERATIONS SECURITY - Information Security Management Handbook, Sixth Edition (2012)

Information Security Management Handbook, Sixth Edition (2012)


Operations Controls

Chapter 25. Complex Event Processing for Automated Security Event Analysis

Rob Shein

Over the past 15 years, the same mantra has wound its way through the security industry, particularly the entities and offerings concerned with incident detection and response: “Collect More Data.” The problem is that although the means to collect and store that data has grown in both variety and scale (as have the sources of the data that can be collected), the means to analyze that data has not kept pace. As a result, what began as one or two IDS devices and/or firewalls providing data has turned into, in some cases, tens of thousands of endpoints sending their data to collection points. Everything from a high-end IPS down to the system logs of a low-end desktop at a receptionist’s desk can provide useful data, but the challenge is finding that data amidst the tremendously high noise background of just plain normal activity. Additionally, the amount of data being collected is so large in some environments that the current state of the art for event rule processing is no longer sufficient to detect the attacks that involve multiple components of activity in an enterprise.

Current commonly deployed rule detection engines use a very basic form of correlation, based primarily on streams of information. Firewall logs, IPS/IDS alerts, SNMP messages from antivirus servers all are sent to a single point, which then watches for two pieces of data within a certain temporal proximity to each other according to a rule. If two such pieces of data come through the stream within a certain interval of time, then the rule fires, and an action takes place as a result. That action—e.g., checking to see if a buffer overflow directed at a device was actually successful—is largely manual in nature, done by humans at the speed of phone calls and the rate at which an e-mailed trouble ticket can be read and acted upon. As a result, in many large environments, it can take up to a week to determine if a security incident is real or just a false-positive, and in a week, an attacker can do a great deal to further strengthen his/her position in a targeted environment, as many have already begun to discover. The goal of the approach listed here is to reduce that window of time, by automating the early responder activities using rule engines that can perform the kind of logical actions that humans do when performing early incident response.

To examine the nature of the problem, it is important to first establish the difference between the two types of rules being processed. The underlying rule engines define the nature of the rules, which in turn define the limitations and challenges of event processing that uses those rules. Significant changes to the rule engines also incur changes to the architecture, which has both advantages and challenges that will be discussed later.

Current methods are the equivalent to a person watching activity in a certain area with a rule that is labeled “Two people just got married,” with a definition based on witnessing two people—a man in a tuxedo and a woman in a white wedding gown—emerging from a church and getting into a limousine. The problem with this is that a great many things can cause a false-negative. A marriage is between two people of the same gender, a variation in the theme of dress for the couple, or even failure to use a limousine as the mode of transportation away from the church will all cause the observing rule engine to fail to recognize a valid event. Conversely, although the definition of what would be observed can be loosened, this in turn will result in false-positives—the very nature of which is so very resource-intensive from a process management perspective in the first place. What if the couple is wealthy and has emerged from their home to attend a costume party via limousine? What if it is a couple getting into their car to go home after attending a Sunday mass? This challenge is bad enough when the actors being observed are indifferent to observation; it becomes far more troublesome when the actors are aware of the risk of observation and wish to evade identification of their actions, as is the case with rules that intend to detect hostile activity on a network.

Imagine, instead, a rule system that works retroactively. Two individuals get into a vehicle of some sort and depart; this is the point of entry for the rule. From there, the observer asks, “What kind of building did they emerge from?” Or, more accurately, the observer asks a set of questions, including whether people threw rice at the couple, the nature of music heard from within the building they just came out of prior to getting into the vehicle, and many other things that might be indicative of a wedding. Furthermore, aspects of the rule may trigger deeper, more resource-intensive investigative actions. These can be activities like walking up to a bystander who exited the same building as the couple to ask some basic questions about what transpired inside; this kind of activity would be too resource-intensive to perform every time a couple exited the building (it is safe to say that whomever kept asking random people such questions would be asked to leave, after all) but would be perfectly acceptable as a resolution effort in the face of enough evidence that a wedding had indeed taken place. This kind of conditional rule system uses what is known as “branching logic,” where the processing path of the rule will vary based upon what data is fed to the rule. This allows for iterative examination within a single rule, which in turn allows other approaches. A rule system like this can be tied to an orchestration engine that can do things like going forth and fetching data from non–stream-based sources (like the bystander) and returning the result for additional processing.

The kind of engine that runs rules of this nature is known as a Complex Event Processing (CEP) engine, and there are several companies that make such software. Tibco, Informatica, Aleri, and Vertica compete in the space, in different ways. A CEP engine does not exist on its own as a useful system, but instead must be coupled to a larger architecture in order to be effective. Part of that architecture must include an orchestration engine (also produced by companies such as those listed above) and adapters to accept and process data from outside sources.

What is important to remember is that the use of a CEP engine (and the complex rules that it can process) does not supplant or replace existing architecture. Instead, it supplements it, allowing for automated fusion of data between existing silos of information. The architecture functions much like the command structure of a large naval vessel; the captain performs a decision-making process, while the executive officer delegates tasks and queries the sources of data. Commands pass down through the organizational structure to the different areas of shipboard operations—damage control, flight operations, engineering, and so on—and in this manner, a single man can quite effectively maintain command and control of an aircraft carrier with over 5000 sailors under his command while projecting power over an area on the side of a continent. In the architecture described here, the CEP acts as the captain and the orchestration engine serves as the executive officer. An SOA-based infrastructure enables communications as would take place aboard the ship to different areas of the enterprise.

Now, to use this analogy in a computer security context, imagine the following scenario:

An attacker performs a cache poisoning attack against the victim’s DNS server.

The attacker then sends a series of e-mails to a subset of the victim’s population, containing a link that points to the same domain leveraged in the DNS cache poisoning attack.

The DNS entry in the cache of the victim’s DNS server would point to a transparent proxy out on the larger Internet (controlled by the attacker) rather than the correct system; this proxy would return a browser exploit in return traffic to members of the targeted victim population and install a purpose-built piece of malware on their systems.

Now, in this scenario, the initial attack is quite noisy and easy to detect. Unfortunately, it also gives little indication as to the true intent of the attacker or the exact systems being targeted. To discern that, a lot of manual action is normally taken—especially because the disparate systems involved are usually run by separate departments within a larger enterprise. Such activities are done by trouble tickets, e-mails, and phone calls—many of which require the action of finding out who exactly to contact in the first place. This is the kind of laborious process that slows down the incident response.

Imagine, instead, there existed a complex rule that initiated once the DNS cache poisoning attack was detected, which in turn would extract the Fully Qualified Domain Name (FQDN) from the alert provided. Following up on that information, the CEP could query outbound proxy servers for all activity within a certain period of time (ranging from the initial detection of the DNS cache poisoning attack until the current moment) involving that FQDN; the list of internal hosts that would exist within that recordset is the list of systems that have been compromised. The user accounts of that proxy traffic would be the people who were tied to those systems, but this alone is not necessarily helpful information. To really get to the answer of who was targeted within the victim organization, the orchestration engine would trigger a query to find all inbound e-mails containing that FQDN; of these, there would be a specific population (if not a same set) of e-mails that originated from the attacker. The list of people who received this e-mail would be the list of people who were targeted. From this list, some information may be gleaned as to the attacker’s motive—why them? What would these people have in common, that would make them a valuable target, and to whom?

Without an automated process, the DNS cache poisoning attack would be just as obvious, but the different steps listed in determining the precise nature of the attack would all require manual intervention. Adding to the challenge would be that no single “pane of glass” would have access to all of the sources of data, so coordination and information requests via human-intensive means would be necessary to perform each step. And even then, the potential for human error or inconsistency—there are likely to be more than one outbound proxy server, e.g., and what if one is missed?— gives the attacker the opportunity to maintain a degree of his/her control in the victim environment even after the incident is considered closed.

The orchestration engine enables automation of certain functions as well, including the definition of predefined courses of action. These, in turn, facilitate the simplification of frequently repeated tasks, using a flexible approach that can evolve as new technologies and methods are put into service. Another benefit to an automated approach to such incidents is the way it lends itself to the gathering of metrics. The workflow processes incorporated within this approach can themselves generate metrics to determine where further rules development is needed. If, for a frequently occurring event, a nonstandard course of action has taken a significant percentage of the time, then that in turn indicates a potential need to define new courses of action within the workflow component of the architecture. Additionally, metrics can indicate where additional proactive security measures are lacking; the recurrence of certain kinds of security events, again and again, typically indicates a systemic weakness that should be addressed.

There are challenges to achieving this level of sophistication, however. The basis for the rules can be something of a challenge; the greater potential afforded by the rule engine also opens the door for enough possibilities that determining the right approach can be difficult. For this, the best approach is to start with automating what activities currently take place. Go for efficiency first and new capability later. The larger challenge will be in getting organizational buy-in to be able to consume data sources and in the effort needed to integrate those data sources into the overall system. As with most things, this is a question of organization, available skilled people, and cooperation more than technical difficulty.

In closing and summary, much of the current activity involved in dealing with security events revolves around repeated actions, which are (in organizations with mature security operations) fairly well-documented but difficult or impossible to automate using standard off-the-shelf security technologies available today. The issue is one of rules logic; it is possible to document these activities on paper, but not possible to describe them with standard correlation rules. Use of a complex event processing engine in conjunction with automated orchestration changes makes it possible for an organization to not only simplify and automate such actions, but also accelerate the rate at which they are performed. Furthermore, the way in which they can be automated helps facilitate capturing statistics about which events still have poorly developed responses, which in turn can be used to drive evolution of the rules logic and corresponding response actions.