Tools of the Trade - Crafting the Infosec Playbook (2015)

Crafting the Infosec Playbook (2015)

Chapter 7. Tools of the Trade

“...a vision without the ability to execute it is probably an hallucination.”

Stephen Case

In the early days of the security industry boom back in the late 1990s, there were only a handful of dedicated security product vendors. Most commercial security tools were offshoots or acquisitions by larger companies, and when the topic of network security tools was discussed, firewall or antivirus were the first words to come to mind. Today, there are literally hundreds of companies with security products and services that cover just about every aspect of information and network security. From password managers and social media leak detection to content-aware firewalls and breach detection systems, there is an abundance of security technology available. Many vendors offer expensive all-in-one tools or managed security services that purport to take all your data and abstract it into actionable security monitoring. The security industry has grown so huge that it has become a commoditized niche industry. You can spend millions on security solutions under the guise of protecting your network.

However, we reject the concept of the security “black box,” or the one vendor that claims to do it all without providing sufficient detail about how detection actually happens or even if it’s working properly. Proprietary detection methods and indicators are not helpful when attempting to investigate a possible breach. We know that we can never detect nor prevent 100% of the security incidents 100% of the time. Data matters most, and architectural inadequacies—highlighted by your security tools—help drive precautionary changes. Any solution must provide a rich body of evidence for incident investigation and confirmation. Lacking thorough evidence puts you at risk of failing to detect attacks and unnecessarily disrupting host or user access. Crafting your own playbook is an individualized process that’s unique to your organization, regardless of what tools and event sources you already have or plan to acquire. It demands you peel back the cover on your technologies to understand specifically what to look for, what details are available, and more importantly, how to detect incidents.

The right tools for your environment depend on a myriad of factors, including budget, scale, familiarity with the products, and detection strategy. With a reliable set of fundamental tools, adherence to security best practices, and a data-centric playbook approach, you can extract and utilize all the information you need. In this chapter, we’ll discuss limitations and benefits of core security monitoring and response tools, deployment considerations affecting efficacy and accuracy, examples of successful incident detection, and how to use threat intelligence to enhance your detection capabilities.

Defense in Depth

Just as you need more than a hammer to build a house, you need a variety of tools to properly build out a decent incident detection infrastructure. Defense in depth requires that you have detection, prevention, and logging technology at multiple layers so that you don’t miss important event data and evidence for investigations. Remember that you will never discover every incident as it happens—there will always be a place for post-hoc investigation. Determined attackers will find a way to flank your defenses no matter how deep, or exploit the weakest link in your systems (usually human trust). However, when it comes time to investigate and trace every step of the attack, you will need as much specific and relevant data as possible to support the investigation. Defense in depth also helps with additional redundancy in security monitoring operations. If a sensor or group of sensors fails, or is under routine maintenance, other tools can fill in at different layers so that complete visibility isn’t lost.

Successful Incident Detection

To help shape your thinking about how to build out sufficient defense in depth, let’s consider some classic models and compare their layered approaches to network and information defense.

Table 7-1 illustrates the seven layers of the Open Systems Interconnection (OSI) reference model and their corresponding defense-in-depth layers:

OSI model layer

Defense-in-depth layer

Application layer

Log files from servers or applications

Presentation layer

System logging, web proxy logs

Session layer

System logging, web proxy logs

Transport layer

Intrusion detection

Network layer

Wireless intrusion detection and switch port filtering

Data link layer

Switch port controls and filters

Physical layer

Switch port controls and filters

Table 7-1. OSI layers mapped to detection layers

Just like Ethernet frames, datagrams, and packets, good incident detection will build upon layers. Detection logic for application-level security might come from indexed and searchable logs that show what happened before, during, and after a process or application has launched (or crashed). For the application layer, you should be logging all authentication failures and successes on critical systems. At the transport and network layers, you can monitor for unexpected or anomalous connections to Internet hosts, fake IPv6 router advertisements, or unexpected internal host traffic. At the data link layer, you can monitor switch CAM tables for ARP spoofing or (de)authentication attacks on your wireless infrastructure.

Another model to help you think about defense in depth is the “intrusion kill chain,” a framework developed in part by U.S. company Lockheed Martin to describe the steps required in a successful attack (Figure 7-1). Advanced attacks by determined adversaries occur in organized, multistage processes, chaining together various methods and exploits to achieve their goal. Just like in an advanced attack, however, defenders can use their own chain to detect or block the attackers’ techniques.

The kill chain

Figure 7-1. The kill chain

Detection capabilities exist for almost every step of the kill chain, whether before or after successful exploitation (Table 7-2).

Kill chain phases

Defense-in-depth response

Recon

Recon can be detected by monitoring unusual connections and probes to external web applications, phishing campaigns, or externally facing services with your security sensors. Good results from indicators may get lost in a sea of Internet scanning, but it’s worth watching for anomalies from outside and inside the network.

Weaponize

Weaponized exploits can enter your network through numerous methods, but due to high rates of success, the most likely compromise is via email phishing, company website/credentials compromise, or drive-by/watering hole attacks on external services known to be used by the target. Direct attacks on external applications after some probing can be seen in IDS or web/application firewall logs.

Deliver

Network IDS and host-based security like intrusion prevention systems or antivirus can send out an alarm at every stage except the weaponized phase.

Exploit

Antivirus and host intrusion prevention logs can indicate exploit and malware installation attempts.

Control

The control, execute, and maintain stages offer the best chance of successful detection, given that data eventually has to leave your network to return to the attackers or an affiliate.

Execute

System log files, behavior anomaly analysis, or other traps and controls might help to stop the attackers from getting what they came for. Unusual system activity might not be detected. However, better readiness and minding the valuable assets improves your chances of foiling the attackers’ plans.

Maintain

Once inside, attackers will persist in retaining control of any successfully compromised system, either to retain a platform from which to launch additional attacks or to continue exfiltrating sensitive and confidential data. Proper password hygiene and top-notch OpSec frustrate or invalidate additional lateral movement, and system authentication logs offer invaluable details.

Table 7-2. Kill chain phases mapped to detection layers

WARNING

What happens if your organization undergoes an unscheduled security audit? If penetration testers start probing your network from the outside, even if it gets lost in the sea of other probes, you can at least point to logs at the end of the audit to indicate you “detected” them. Logs of this type can also be used later when creating a timeline or investigating later.

The big takeaway from both models is the need to have detection at every layer possible. If you only invest in breach detection, intrusion detection, or antivirus, you’ll be missing data from a whole class of attacks and techniques. You will still detect event data, but you will never have the complete picture, and reconstructing a timeline will be impossible. Understanding how an incident unfolds enhances an organization’s security protections, and it helps avoid compromise a subsequent time by influencing improved architecture and policy standards.

Hack me once, shame on you; hack me twice, shame on me.

WEB ATTACKS

A watering hole attack occurs when an attacker compromises a website known to be accessed by the victim or members of the victim’s organization. Like a crocodile waiting for prey to inevitably drink water from the hole, so does the attack wait for their victim to access or log in to the compromised website.

A drive-by attack (or drive-by download) happens when an attacker compromises a victim web application (the patsy), or hosts their own malicious site, and then injects exploits or redirection to other victim sites. In the end, more victims pile up as the attacker’s tools (commonly commercial exploit kits) silently execute from the web browsers of unsuspecting victims. Unlike a direct attack on a known site like the watering hole example, attackers leveraging the drive-by method often inject their scripts into syndicated advertising networks to improve victim exposure to their attack.

The Security Monitoring Toolkit

To investigate any security incident, you need evidence. Any worthwhile security monitoring tool will generate event data that investigators can analyze. To build a corpus of information for monitoring the security of your network, you need tools, their data, and somewhere to store and process it. Putting together the right toolkit demands you understand your network topology and scale, your business practices, and where protection is the most critical. It’s also important to understand the pros and cons of the various security monitoring tools available. You need to know how to use tools properly to get useful information from them, as well as how the tools actually work. Some abstraction in your toolkit is fine, but the more familiar you are with how the tools operate, the more effective they become. All tools require some level of configuration to ensure they are relevant to your network, and many tools require ongoing tuning to ensure you are not overloaded with useless data. System health monitoring and event validation are paramount if you intend to keep the service running. Remember that, in the end, the tools are what feed data into your playbook. Choose wisely and understand your requirements well before investing in a commitment to any technology.

Log Management: The Security Event Data Warehouse

To execute a playbook efficiently, you need to aggregate all security logs and events into a searchable nexus of data and metadata. Classically, incident response teams pointed their event data to SIEMs for this purpose. However, with modern toolkits and log collection architectures, it’s possible to shed the burden of expensive, inflexible commercial SIEMs for a flexible and highly customizable log management system. The playbook is about much more than just a tool, like a SIEM, that returns report results. The objectives in the plays help communicate what the team is doing and help prioritize areas to focus on based on “what we’re trying to protect.” The analysis sections are the documentation and prescription for the analysts; the play comments and feedback are a place for analysts to discuss issues, tweaks, and so on.

Ultimately, having only a SIEM delivering report results without the other support infrastructure around it is less valuable to your organization than the comprehensive playbook. No automation or algorithm can make accurate, situationally aware decisions based on security alerts like a human can. Pairing readable log information with proper understanding of your metadata will yield excellent results. Log management and analysis systems render your information in a variety of configurable ways, opening an enhanced search and reporting engine for logs.

Developing an organization-wide log collection system, containing network, system, and application logs, offers ample benefits to your IT department, as well as InfoSec and application developers. IT and developers need accessible log data for debugging or troubleshooting operational or software issues. For security, we want all syslogs and system logging we can get, along with our multitude of security monitoring technology logs.

Deployment considerations

A log management solution has to be large enough to store all security event logs and speedy enough to allow for event retrieval and querying, but how do you accomplish this? Will you need a 50,000-node Hadoop cluster or a single syslog server? We’ve described what makes logs useful (“who did what, when, where, for how long, and how did they do it”), meaning that anything that doesn’t provide this level of information generally can be discarded. We’ve also discussed how logs hold the metadata truth necessary for creating repeatable searches and reports for your playbook. A good log management system will return data that looks a lot like the original log data as generated by the system, versus a digested alert produced by a SIEM. Getting closer to the real event gives you much more flexibility in understanding why it happened, and helps to research additional attributes without the prejudice of an automated alert.

The trick to a successful deployment with real log data is to collect and tag what you can, and filter out what you don’t need or can’t interpret. Adding context inline to raw event data will also help tremendously. For example, if you can flag an IP address in a log event as matching a list of known bad actors, you might look at an alert with more suspicion. It can also be helpful to include internal context as well, such as tagging address ranges by their function (datacenter, domain controllers, finance servers, etc.) or tagging user IDs with job titles to help recognize targeted attacks against vulnerable and/or valuable targets like executives.

Like with most tools, there are many log management systems available, from commercial to open source with a wide variety of architectures and options. When deciding which tool to choose, consider that a proper log management system needs to:

Be flexible and modular

It would be naive to think that you’ll never add more security event sources to your detection arsenal. The log management system must be able to support future event sources, support various log formats from free to highly structured text, and support log transport methods like syslog over UDP/TCP to Secure FTP (SFTP) and others.

Parse and index log event data

You must be able to extract usable fields from your log sources. IP addresses, timestamps, event context, and other details are important to break out from the raw log message. This means you will have to employ some type of regular expression against the raw log data to parse and extract fields. After extraction, the fields must be loaded into an index, cluster, table, graph, or database so they can be searched later.

Provide a robust and simple query interface

Providing an expressive and functional query interface allows your analysts to develop readable and effective queries that lead to playbook plays. Having mathematical and statistical operators available also makes developing queries easier. Often it’s helpful to identify trends or outliers in event data. Determining quantity (i.e., number of events, hosts, events by host, etc.) and other statistical relationships will help with both development and report presentation. Supporting a basic syntax or language for query development also enables anyone to share ideas and work on refining their queries or adding more advanced query features as necessary. If the query interface isn’t easy to use, it’s likely some analysts will use it improperly or not at all. In the event of a security emergency, you also need the ability to retrieve search results quickly rather than spending time developing an overly complex query or graph analysis.

Retrieve log data with ad-hoc and saved searches

Having all the supported log data in your management system is one thing; knowing how to get useful information from it is another. To create a playbook, you’ll need the functionality of saved searches to recall data for later analysis. SIEMs generally deliver canned queries and reports that purport to inform you of security incidents. A log management system should provide a concise method for saving and scheduling searches and reporting event details. Unless you have analysts looking at screens all the time, you’ll need a way to queue up event data for future review. Result presentation is also a big factor in deciding which solution to adopt. If you can develop reports that are readable and understandable, you can move alerts through the team and the standard processes with a common understanding of their significance. Presentation delivery mechanisms like graphs, dashboards, and HTTP or email feeds can give your team and case handling tools plenty of alert data. Exporting event data in any format (JSON, XML, CSV, etc.) is a nice capability that can be leveraged to feed other applications, such as remediation tools, case tracking, or metric and statistical collection systems. Tying your detection systems with other systems improves response time and removes the possibility of human error.

Ensure availability of your log management system and the data feeds contained within

System, network, and database administrators measure their availability in terms of 9s. A 99.999% uptime equates to less than five minutes of downtime total per year. What level of service availability can you expect or are you able to offer to your analysts for the security feeds? Most security monitoring data feeds are dependent on external teams, whether it’s a span from a network administrator or event logs from an Active Directory administrator. As a consumer of their data, system maintenance, configuration changes, and hardware or software crashes on systems can cause outages within your own service offering beyond your control. You should set up service availability checks specific to your requirements, and establish escalation procedures with external groups when you detect service interruption.

In a nutshell, the essential pros and cons of log management tools come down to this:

§ A SIEM ties you to views and alerts defined by the vendor, or to formats they support, whereas log management gives you flexibility to detect and respond in the method you define.

§ Log management systems can provide the flexibility and modularity necessary to discover and respond to threats potentially unknown to canned commercial systems.

§ Log management systems require a lot of time investment to ensure they are optimal and expedient, but the return for that time investment is unparalleled visibility.

§ Storing and indexing lots of data in a log management system can be expensive, but looking back at a prior event is critical to any incident response operation.

Intrusion Detection Isn’t Dead

Having just finished the final touches on a massive intrusion detection deployment for my organization at the time, I was a little disappointed (and incredulous) after reading the widely respected Gartner research group’s prediction that “IDS as a security technology is going to disappear.” The argument back in 2003 was that IDS would become irrelevant as securing systems and network architecture along with additional security controls and risk management would become more ubiquitous. Intrusion detection it turns out was, and remains to be, far from dead. The additional security controls described were easier said than done, and the complexity of datacenters and networks has only increased over the years, making them more difficult to secure. Not to mention the additional risks assumed by partner network interconnects, acquisitions, and cloud-hosted services. Network detection of data exfiltration, external attacks, internal attacks, or any cleartext network traffic that can be matched against a pattern can (and in most cases, should) be monitored by an intrusion detection system. Why? Because an IDS provides a customizable view of a network session, from buildup to teardown, and as such, other than a full packet capture, provides the most possible level of detail. There are countless ways to detect incidents using an IDS, from esoteric TCP sequence number manipulation to simple pattern matching using regular expressions against HTTP. The strength and utility of the IDS boils down to both where you deploy the sensors, and how well you manage and tune their alarms.

The one salient point from the Gartner research article was that an IDS often produces a significant amount of false positives, or alarms that represent only benign activity. This is completely true. However, also consider that if you bring home a new cat and you never feed or vaccinate it, it’s not going to be an effective pet either. An IDS is not a plug-and-play technology. It requires proper deployment, tuning, and event management to be a useful tool in your defense-in-depth strategy. Running an IDS network means routine work on the system, so plan to review monitoring techniques and policies regularly. This is so important to our team that members meet weekly to discuss the IDS findings and any tuning issues that need resolution.

Deployment considerations

With most things in the computer world, there is always more than one way to accomplish a task. The same goes for security technologies. There are numerous ways to deploy most of the tools listed in this chapter and IDS presents a classic InfoSec dilemma right out of the gate: Do we block traffic inline, or do we log attacks offline for analysis?

Inline blocking or passive detection

In its most simplistic form, you can send a copy of network traffic to an intrusion detection sensor, or you can deploy it (now called an intrusion prevention system, or IPS) inline with network traffic. Inline deployments offer the obvious benefit of transparent traffic blocking or redirection capabilities, similar to that of a network firewall. What sets the IPS apart from a firewall are the signature matching and upper-level protocol inspection capabilities of the sensors. In general, a firewall blocks based on a preconfigured policy. The IPS will block based on a policy, but has much finer controls on when and what to block. Many vendors offer numerous ways to actually block the traffic, including commanding the IPS to generate a firewall rule or ACL to deploy on another device. Most commonly, however, an inline IPS will simply drop the traffic on its incoming network interface.

Inline deployments offer the benefit of both preventing some attacks, as well as generating log data that can be mined and searched later. A new play might look at IPS blocked traffic from internal hosts that might indicate a worm infection or perhaps a malware callback to an external host. Another report might also look for exploit attempts blocked by the IPS from internal hosts attempting to gain unauthorized access (or possibly just penetration testing). Even if a connection is blocked, it’s worth noting and investigating, because the source host generating the alarm may either be compromised or up to nefarious activity, potentially including other techniques that may go unblocked or even undetected.

Although inline deployments sound attractive, they are not without their flaws. Of primary concern is the most fundamental component in the system: the network interface. The ubiquitous, cheap pieces of circuit board and copper you are forcing your whole organization’s network connectivity through. If it experiences impaired capacity, insufficient throughput, or crashes, what impact will that have on the rest of your network? Redundant sensors all have to inspect the same traffic to ensure continuity. If adequate (and timely) failover or expensive high availability options are not present, a network outage resulting from device unavailability can cause SLA breaches, production roadblocks, wasted resources, and possibly missed attacks.

Hardware failover issues aside, there are other considerations to factor in when planning an inline deployment. It’s important to understand that even with the most conservative timer configurations possible, the spanning tree protocol (or rapid spanning tree protocol, RSTP) can fail to reconverge during an inline sensor outage. When an inline sensor goes down in an environment running RSTP, if the switch’s configured “hello” time interval expires before the sensor’s interface returns to operation, the switch will send traffic down a different path. This means when the sensor returns to service, it may no longer be receiving traffic. Similar problems can occur with routing protocols such as Open Shortest Path First (OSPF) if the timers expire before the IPS comes back online. This type of scenario can occur when a sensor is rebooting or even if a new policy or signature update is applied, forcing a software restart.

The other main issue with inline deployments is signature fidelity. Losing all connectivity due to a hardware or software problem is catastrophic, but there are more sinister problems that are harder to troubleshoot. Network latency and any application or protocol that is sensitive to the additional delay created by IPS inspection can create difficult-to-solve problems. If a TCP or link error occurs on the remote end of a connection, it’s possible that an inline sensor may reject the retransmission because the sequence numbers are unexpected, or some other network or transport layer error occurs, resulting in dropped traffic.

More concerning with an inline deployment is the fact that legitimate traffic might be blocked, or that false positive alarms will also result in erroneous blocks. As mentioned previously, constant tuning is required to keep IDS relevant. To stay viable and accepted by users, network security must be as transparent as possible while remaining effective. Blocking or dropping traffic improperly will result in application problems and user-experience degradation. Further, in the event of a targeted attack, you may not want to block the outbound traffic as part of the investigation. For example, if an attacker is attempting lateral movement or data exfiltration within your organization, rather than simply blocking their traffic outright (assuming it’s even possible), you may want to gather as much detail as possible (e.g., packet captures, files, etc.) that show the precise attack methods involved. You don’t want to tip your hand to any advanced adversary that you are on to them until you have what you need. Once understood and analyzed, it’s then acceptable to resume blocking their traffic. Now that you’ve captured additional data, you have more details to justify even stricter controls to further secure your attacker’s targets and watch for new attack indicators.

Additionally, because your sensor is inline with production traffic, expect repeated scapegoating of the device when something goes wrong with the network. Anyone who has completed a successful inline deployment of a security tool has experienced this phenomenon. It’s the ugly side of Occam’s razor that if something goes wrong, it must be the new technology the security team deployed. This puts the security team in a troubleshooting role anytime something happens that could be blamed on the IPS.

The best approach when going inline is to start with a passive deployment that has no impact to network operations. This gives you time to tune your sensors and get comfortable with their operation while researching failover capabilities and the other foibles of inline technology.

However, if you have made the decision not to go inline, the only alternative is passive detection. This means the IDS will not block any traffic and will only send notifications when traffic might have been blocked (were it in inline mode). Remember, in both modes, you still get event log data that can be queried and turned into plays. Passive mode, however, affords you greater confidence in network operations and uptime, and is a desirable approach in environments with higher throughput and uptime sensitivities (ecommerce, trading systems, etc.). Attack traffic will still need to be mitigated, but you will now have log data to further investigate and utilize in your defense-in-depth strategy to stop attacks at other layers.

Location, location, location

Where you put the sensors in your network traffic flow is just as important as deciding whether or where to go inline. When you think about the intended goal of an intrusion detection sensor, you realize in order to provide any value, it needs to be monitoring the most relevant network traffic between systems you are trying to secure.

The most sensitive, and often critical, parts of an organization’s IT infrastructure typically occupy the DMZ or the datacenter. DMZ networks are the front face of your organization, whether they are web servers, application services, development labs, or even your Internet connections. With an appropriately restrictive security architecture and adequate network segmentation, only hosts in the DMZ networks should be able to connect to the Internet directly.

Naturally, it makes sense to deploy intrusion detection between the DMZ and the Internet, as well as between the DMZ and the internal network (Figure 7-2). Monitoring these two choke points ensures that you inspect all traffic into and out of your organization’s network (unless, of course, wise guys are using a mobile wireless connection like 3G/LTE).

Comprehensive detection coverage demands inspecting at gateways to various classes of traffic, but it also demands data segmentation, or deduplication (e.g., datacenter to Internet traffic may be inspected and logged twice).

In general, the most important business critical systems operate in an organization’s internal datacenter. While many applications and services are increasingly hosted externally through third-party extranet or cloud-based vendors, it’s rare that an organization won’t at least have some critical services operating in their datacenters—local Windows domains and controllers, authentication services, financial systems, sensitive databases, source code, development servers, and many other important pieces of an organization’s IT infrastructure. The datacenter boundary is a great place to deploy additional intrusion detection. Anything into or out of the datacenter should be monitored. The same goes for any network segment hosting critical services or data. Choosing the most appropriate network intersections to monitor will greatly improve your deployment experience. Ideally, it would be wise to collect at all possible choke points, like desktop or lab uplinks, or intra-datacenter traffic. For larger organizations, in most cases, the traffic volume for this inline approach is overwhelming and not likely to yield as many actionable results, if only because of the additional volume of data inspected. In smaller environments, monitoring most network interconnects makes sense as long as you can accurately deduplicate alarm data that might have triggered twice for the same connection.

An example of an effective IDS or IPS architecture

Figure 7-2. An example of an effective IDS or IPS architecture

Let’s look at real-world examples

We have already hinted at a few possible reports leveraging the power of IDS. In one specific report, we employed IDS to detect Structured Query Language (SQL) injections. Using the built-in generic SQL injection signatures (looking for classic signs like the SQL commands UNION followed by SELECT, substring((select, Drop Table, or validations like 1=1) combined with a basic string match for some known attacks against content management systems, we developed a report that produces great results when someone is attempting to attack web infrastructure. The following is example data yielded as a result:

GET

/postnuke/index.php?module=My_eGallery&do=showpic&p=id=-

1/**/AND/**/1=2/**/UNION/**/ALL/**/SELECT/**/0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

concat(0x3C7230783E,pn_uname,0x3a,pn_pass,0x3C7230783E),

0,0,0/**/FROM/**/md_users/**/WHERE/**/pn_uid=$id/*

HTTP/1.1

Connection: keep-alive

User-Agent: Mozilla/5.00 (Nikto/2.1.5) (Evasions:None) (Test:000690)

Host: us-indiana-3.local.company.com

X-IMForwards: 20

Via: 1.1 proxy12.remote.othercompany.com:80

X-Forwarded-For: 10.87.102.42"

Notice the User-Agent Nikto (a common web application vulnerability scanner) looking for SQL injection success against the Postnuke content management system running on host us-indiana-3.local.company.com. The Via and X-Forwarded-For headers indicate the host’s attack was proxied through an external web proxy. Here the IDS digs deep enough into the packet to not only parse the HTTP URI header, but also the HTTP transaction log that gives us the true source IP behind the web proxy. A slight addition to the default generic signatures yields specific data about this attack. We identified the host owners of 10.87.102.42, and through interviewing, discovered they were performing an authorized penetration test against our host, us-indiana-3.local.company.com, without notifying our team beforehand.

IDS can find and report malware infections as well. If host-based protections are absent or have failed, IDS may be able to capture data as it leaves the network. In this example, the IDS detected an HTTP connection that contained no cookie information, no HTTP referrer, and matched a particular URL regular expression—the combination of which are known confirmed indicators of the Zeus banking Trojan:

Signature:

alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:"MALWARE-CNC

Win.Trojan.Zeus variant outbound

connection - MSIE7 No Referer No Cookie"; flow:to_server,established;

urilen:1; content:"|2F|"; http_uri;

pcre:"/\r\nHost\x3A\s+[^\r\n]*?[bcdfghjklmnpqrstvwxyz]{5,}[^\r\n]*?\x2Einfo\

r\n/Hi";

content:!"|0A|Referer|3A|"; http_header; content:!"|0A|Cookie|3A|"; http_header;

content:"|3B

20|MSIE|20|7.0|3B 20|"; http_header; content:"|2E|info|0D 0A|"; fast_pattern;

nocase; http_header;

metadata:impact_flag red, policy security-ips drop, ruleset community,

service http;

reference:url,en.wikipedia.org/wiki/Zeus_(Trojan_horse);

classtype:trojan-activity; sid:25854; rev:5;)

Result:

sensor=sensor22-delhi.company.com event_id=154659

msg="MALWARE-CNC Win.Trojan.Zeus variant outbound connection -

MSIE7 No Referer No Cookie"

sid=25854 gid=1 rev=5

class_desc="A Network Trojan was Detected" class=trojan-activity priority=

high src_ip=10.20.124.108

dest_ip=[bad.guy.webserver]

src_port=3116 dest_port=80 ip_proto=TCP

blocked=No

client_app="Internet Explorer" app_proto=HTTP

src_ip_country="india"

dest_ip_country="united states"

You can see the source host (src_ip) that’s making an HTTP callback to a Zeus C2 server (dest_ip). Also note the blocked=No, which indicates we simply detected this connection and did not block it. Subsequently, after confirming this event with other data sources, we blocked all access to the dest_ip from anyone in the organization.

Limitations

IDS is a powerful post-hoc investigation tool only if it correctly captures alarms when something has happened. Unfortunately, you have to know what you want to detect before you detect it, and unless there is already a signature written and available, you will miss out on some attacks. By its nature, a signature-based system will never be fully up to date. A typical IDS signature cannot be developed until an exploit or attack is already identified. The time between exploit release and signature development appears to be shrinking; however, there’s also the delaying issues of quality control checking signature updates. Having to make sure a poorly written signature in an automatic signature update doesn’t break anything is a lesson learned from years of managing global IDS deployments. This additional testing time subsequently delays both production deployment and rolling the new signature into new or existing reports in the playbook.

Many IDS vendors offer additional inspection capabilities that go beyond simple pattern matching with regular expression in a signature-based format. Anomaly-based detection and automatic threat research from cloud connected services can expand the capabilities of our favorite glorified packet capture and matching device. That said, an IDS has its limitations:

§ It must be able to inspect every port and protocol to remain relevant.

§ It’s dependent on quality and accurate signatures.

§ There are multiple, successful IDS evasion techniques available.

§ Its output is only as good as its tuning.

§ Throughput and performance can become issues on high-traffic networks.

§ In general, it has only static (passive) detection methods through a signature-based approach.

§ You don’t necessarily know if an attack was successful.

Because an IDS alarm fires, it does not mean there’s a security incident. Hosts on the Internet with no filtering will get port scanned. Because of this simple fact, there’s not a lot of value reacting to basic scanning alarms from the Internet toward your systems, although it is worth watching trends and outliers in scanning activity. IDS signatures typically alarm for exploit attempts (via either exploit kits or application vulnerability attacks), known malware signatures, or network or system anomalies observed on the network. If you see a client host connecting to an exploit kit, do you know whether the host was infected? All the IDS really tells us here is that they may have been exposed to malware, but not that it executed. Tying the IDS log data with host-based data or other data sources becomes important to accurately identify an incident worth a response.

The essential pros and cons of network intrusion detection come down to:

§ IDS gives you a platform to selectively inspect network traffic and attacks.

§ Deployment location is critical for efficacy and impact.

§ Tuning is mandatory, and although it can be a lot of initial work, it will eventually pay off in efficiency.

§ Inline or passive deployment depends on your appetite for risk versus your tolerance for outage.

§ IDS and IPS are both reactive technologies, given their dependence on known signatures. The gap between threat release and signature release is rapidly closing, although they will never be synchronized.

HIP Shot

Network IDS is clearly designed to capture malicious activity when it’s headed out the network or into a sensitive area. The chokepoint/outbound detection strategy we use finds plenty of activity, and it’s nearly impossible to escape to the Internet without our team noticing. However, one of the main weaknesses of network intrusion detection is that you do not have an accurate depiction of what’s occurring in traffic (or system processes) that never crosses a chokepoint. Naturally, if you are watching all the outbound traffic, you will most likely identify call-home traffic; however, if you don’t detect it for whatever reason (e.g., infrequent or unpredictable patterns), the attack will still succeed. What’s worse is that other internal systems could have been adversely affected and gone undetected. In advanced attack scenarios, lateral movement after compromising a single host (aka patient zero) is a common next technique to maintain a foothold. As aforementioned, you cannot have IDS at every gateway because of performance and scaling issues; therefore, you need an alternative solution for detecting what’s happening on intra-subnet and intra-segment traffic. This is where intrusion detection forks into the host-based world.

Deployment considerations

Remember with network IDS (and every other security tool), we are after event logs that we can index and search. With a host-based intrusion detection or prevention system (HIDS/HIPS), we not only block basic attacks and some malware, but we also get log data, rich with metadata specific to the host that can point to malicious activity, as well as attribution and identification data we might not otherwise see on the network.

It’s worth differentiating between the commonplace antivirus and the more sophisticated HIPS software. While both technologies can leverage signature-based detection, not unlike a network IDS, HIPS goes a step further allowing application profiling and anomaly detection/prevention with custom rules. A HIPS typically uses kernel drivers to intercept system calls and other information to compare to its policy. Although not a traditional HIPS product due to its focus on different low-level operations, Microsoft’s (free) Enhanced Mitigation Experience Toolkit (EMET) has some illustrative capabilities that perform similar to a HIPS. Protecting memory addressing through randomization, preventing data from executing on nonexecutable portions of a stack (Data Execution Prevention), and other memory-based protections give EMET an advantage when it comes to some advanced malware not yet detected by a signature- or file-based system. Remember that a signature-based system, no matter how good, will always be reactive and cannot prevent an attack until it’s known. Behavioral or anomaly-based security systems can protect against attacks that have never been seen, even if they are never written to a file.

However, a traditional HIPS focuses on file rather than memory-based attacks. The file approach makes techniques like whitelisting work well, particularly for smaller or controlled environments where you can always allow known files and their proper locations, registry keys, etc., without warning or blocking. Another useful HIPS strategy is rather than whitelisting everything, simply raise alarms when certain unusual conditions exist, like Windows registry key permanence changes. A HIPS is also perfect for detecting the download or execution of malware droppers. Some techniques include:

§ Watching for writes to protected system directories or even common user directories, including:

§ %Temp% = C:\Users\<user>\AppData\Local\Temp

§ %Appdata% = C:\Users\<user>\AppData\Roaming

§ C:\Program Files(x86)

§ C:\Windows\System32\drivers

§ User database or registry changes

§ Modification of running processes, particularly common services, including explorer.exe, iexplore.exe, java.exe, svchost.exe, rundll32.exe, winlogon.exe

§ Blocking and/or logging network connection attempts from other hosts

Let’s look at real-world examples

This last point is key for developing a strong layer of detection internal to the network. This is where you could find lateral movement—something other network-based tools cannot provide. For example, if a host is compromised (regardless of its HIPS status) and begins scanning or attacking other hosts that are not behind a monitoring network chokepoint, you could rely on other HIPS agents on the network to record a log message that attacks or scans have occurred.

The following example was logged from a HIPS infrastructure:

2014-10-30 19:02:08 -0400|desktop-nyc.us.partner|10.50.225.116|

2014-10-30 19:07:08.000|The process 'C:\WINDOWS\system32\svchost -k

DcomLaunch' (as user NT AUTHORITY\SYSTEM) attempted to accept a connection

as a server on TCP port 3389 from 10.50.225.242 using interface PPP\modem\

HUAWEI Mobile

Connect - 3G Modem\airtel. The operation was denied.

In this alarm example, we see plenty of interesting data. We know that 10.50.225.242 tried to connect to and launch a protected Windows service (svchost) on the HIPS protected host 10.50.225.116 (desktop-nyc.us.partner) on TCP port 3389. The connection was denied, but we can also see this host was not on our corporate network or wireless infrastructure. The network interface as reported by the HIPS is 3G wireless network. This alarm indicates that a remote host attempted to launch the Windows remote desktop service against one of our internal hosts when they were using their 3G modem. It would be a major concern if the client was using their modem while on the corporate network, as it creates an unprotected bridge to the outside and potentially hostile networks. Remember that you can only have network-based detections at the natural gateways and chokepoints. An out-of-band network connection will go unmonitored, and it presents a significant risk to the organization. The HIPS in this case tells us exactly what happened, to whom, how, and when. It’s also possible that this alarm fired when the client had their laptop at a remote location using their 3G modem and was attacked. The alarm may have been subsequently reported when the client reconnected to the corporate network and HIPS head-end log services.

Beyond the attack data in this log message, we also have a wealth of client attribution data. Let’s say we don’t have a solid system that has a record of who or what asset had a particular IP address at a particular time. Most authentication servers will provide this information when a client authenticates to the network, but what if they were on a network segment that didn’t require authentication? From this alarm alone, we know that the system desktop-nyc.us.partner had the IP address 10.50.225.116 at least for a few minutes on October 30, 2014, around 11 p.m. UTC. We can apply this information to other event data, which can be helpful if we have other logs that implicate 10.50.225.116 in some other security alert or investigation. What would be even better in this case would be a standard hostname that included a username. Something like username-win7 could help us immediately find an owner, or at least a human we can question as part of an investigation. We can cross-check bits of metadata here with other security log indexes that may not have any user information. For example, if we are looking at a network IDS log from around the same time that has less information, we can simply refer to this event to find a hostname or username.

The following is another example that provides attribution data, as well as some collateral log data that doesn’t necessarily indicate an attack or malware, but can be revealing in the context of an investigation:

2014-10-20 07:45:52 -0800|judy32-windows7|172.20.140.227|

2014-10-20 07:52:22.227|

The process 'C:\Program Files\RealVNC\VNC Server\vncserver.exe' (as user NT AUTHORITY\SYSTEM) attempted to accept a connection as a server on TCP port 5900 from 10.1.24.101 using interface Wired\Broadcom NetXtreme Gigabit Ethernet. The operation was allowed.

In this case, we can see the client 10.1.24.101 logged in to user judy32’s Windows 7 system using VNC (a common remote desktop sharing application). The HIPS indicated that the connection was allowed. There’s nothing here that demonstrates anything nefarious, if the VNC session was expected. If it was not expected, then it would reveal unauthorized access to judy32’s PC. If the login was authorized, and then some other malicious event occurred on the PC, we cannot say for certain whether it was judy32 or someone else causing the issue.

Limitations

HIPS has plenty of caveats like any other tool. Primarily, it’s another piece of client software that has to be installed. If your organization doesn’t have tight control over every endpoint, then a HIPS deployment will never reach full coverage, leaving you exposed on unprotected hosts. If you already have security software running like antivirus or other security software, there may be client or support fatigue in keeping them all running at the expense of system performance. The HIPS software, like any other, will have system requirements that may not be possible, depending on your environment, standard software images, and hardware profiles. Where hosts are not fully managed by a central IT authority, the endpoint client environment software and configuration diversity will require a significant investment in tuning the HIPS detection profiles. This is not that different from network IDS, but it’s an often overlooked fact about a HIPS deployment. In an opt-in-style system, imagine the possible variations of client applications that may require filtering or tuning after the HIPS is loaded onto the system.

Another consideration is the host type itself. Adding any additional software, particularly agents with potential kernel shims, introduces a bit more risk to stability and availability. A system administrator or system policy may not allow installation of any uncertified software on critical services like a domain controller, directory, or email server.

All intrusion detection requires tuning, and the HIPS is not exempt. Tuning a HIPS can be a very complex and ongoing process, depending on the complexity of the software profiles on the client PCs. The more you can safely whitelist up front, the easier the tuning will be. Categorizing log data into application exploits, network events, and known blocked malware can make developing playbook reports easier because basic detection categories of HIPS alarms are already set. Proper filtering will help reduce the HIPS server database footprint, while providing only usable security log information. Logging all *.exe downloads seems like a great idea to detect malware; however, if users are allowed to download and install anything, searching through the large resultant data set will require plenty of tuning to ignore false events. However, logging when new *.exes are launched from a protected system directory (particularly if the source process is in a user temporary directory) offers a narrow view into possible malicious activity.

As with all security event sources, if the logs are never reviewed, then the tool is dramatically less effective. HIPS can generate enormous amount of log data, filled mostly with benign or useless information. Monitoring fatigue will be common until precise reports are developed. Tuning the HIPS alarms to ensure you receive the most useful data (i.e., operations monitoring the most sensitive parts of a system), as well as removing information about legitimate software, requires significant effort, but will produce an extremely helpful corpus of log data that fills the intra-network detection gap. HIPS plays an integral part in the host-based defense layer and can inform on areas that network monitoring may miss.

The essential pros and cons of host intrusion detection come down to:

§ Client-side security offers unparalleled visibility to what’s happening with a host. Network-based detection is mandatory, but good host controls provide something that the network cannot.

§ HIPS can offer additional, useful information that can help with user/host attribution.

§ Not every system can run a HIPS, and like all IDS technologies, investment in tuning the alarms is mandatory.

§ HIPS logs can be chatty (even after tuning), although they can leave invaluable clues after an incident.

§ Host-based controls are difficult to maintain and implement in areas where there is little control over the end-user computing environment (mobile devices, personally owned devices, etc.)

Hustle and NetFlow

The concept has many names: NetFlow, Jflow, Netstream, Cflowd, sflow, and IPFIX. Vendor implementations and standards definitions vary slightly, but the essence is the same. Each technology creates a record of connections—a flow—between at least two hosts, including metadata like source, destination, packet/byte counts, timestamps, Type of Service (ToS), application ports, input/output interfaces, and more. Cisco NetFlow v9, as defined in RFC 3954, is considered the industry standard, and is the basis for the IETF standard IP Flow Information Export (IPFIX). As the IPFIX RFC points out, there are multiple definitions for a “flow.” Simply put (and per the IPFIX RFC definition), a flow is “... a set of IP packets passing an Observation Point [a location in the network where IP packets can be observed] in the network during a certain time interval.” Functions applied to each flow determine the aforementioned metadata.

Unlike other data sources such as packet captures, IDS, or application logs, NetFlow is content-free metadata—simply put, this means that the details of what transpired in the connection are not reflected in the flow data itself. Historically, this has meant NetFlow was primarily suited for things like network troubleshooting or accounting, where it was more important to know that a connection happened than to know what transpired during the connection. However, though NetFlow lacks payload data, it is still useful as a security monitoring and incident response tool. From our experiences, NetFlow is used in almost every investigation, whether it’s to create a timeline of compromise, identify potential lateral movement of an attacker, provide context for an otherwise encrypted data stream, or simply to understand the behavior of a host on the network during a suspected time window.

Deployment considerations

Like the other network-based detection methods, the placement of your NetFlow collection infrastructure has a huge impact on your results. Just like there are numerous versions and takes on the NetFlow/IPFIX concept, there are also a number of ways and configurations to consider when deploying flow collection.

1:1 versus sampled

Often, network administrators configure NetFlow to export only a sample set of flows. Sampling is fine for understanding the performance and state of network connections, and helps reduce the overhead for flow storage space and traffic between a flow exporter and collection. But in the context of security monitoring or incident response, sampled NetFlow introduces gaps in the timeline analysis of a host’s behavior. Without a 1:1 flow export ratio, it’s impossible for you to understand exactly what happened before, during, and after a compromise.

We’ve heard it many times before—the fear that exporting 1:1 flows will cause performance hits on routing devices. However, 1:1 NetFlow should not cause a degradation in performance. On most platforms (dependent upon hardware), NetFlow is hardware switched, meaning that the processing happens on application-specific integrated circuits (ASICs). In effect, this offloads processing resources from the processor to hardware, so the impact on performance of the router is minimal. You must work with your network administrators to understand the need for and to configure nonsampled NetFlow data at all your export locations.

NetFlow on steroids

NetFlow has historically been a content-less data source that describes which hosts connected to whom, at what times, and over what ports, protocols, and interfaces. If you only look at flow data, information about the content of those connections or the data being transferred between hosts can only be inferred via context. Host naming standards, DNS hostnames, common port assignments—they all reveal some bit of possible information about a connection. For the purposes of this book, the authors generated a NetFlow record for what looks like a typical web connection:

Start time

End time

Client IP

Client port

Server IP

Server port

Client bytes

Server bytes

Total bytes

Protocol

2014-04-28T15:03:22Z

2014-04-28T15:03:22Z

10.10.70.15

51617

192.168.10.10

80

10212

3606

13818

tcp

From this flow record, we can see that source host 10.10.70.15 made a 10212 byte TCP transfer from port 51617 to port 80 on destination host 192.168.10.10.

Though at first glance this may look like a typical web connection, the flow record in and of itself doesn’t necessarily indicate that a client connected to a web server on port 80. You need additional context to determine if this was actually an HTTP connection. Or do you?

Neither NetFlow nor IPFIX have built-in fields for the purpose of identifying applications. Application identification has historically been left to the collector, based on the flow’s port. Attackers often obfuscate their network connections simply by running services on nonstandard ports, rendering port-based application techniques untrustworthy. Thankfully, the IPFIX authors, in the Information Model for IP Flow Information Export (RFC5102), included the ability for vendors to implement proprietary information elements, beyond the standard elements such as port, source IP address, and protocol. Vendors can now use proprietary Deep Packet Inspection (DPI) engines to determine the application observed in a flow based on flow contents, and record that information in a flow record via custom information elements.

Cisco submitted RFC6759, an extension to IPFIX that includes new information elements identifying application information in flows such as the observed application description, application name, tunneling technology, and P2P technology. Cisco’s vendor-specific NetFlow DPI application identification implementation is called Network-Based Application Recognition (NBAR). Palo Alto Networks offers a similar classification technology called App-ID, Dell’s SonicOS calls it Reassembly-Free DPI, and ntop’s implementation is called nDPI.

What then, should you make of the previous example flow if one of the vendor-specific IPFIX application identification extensions instead identifies the traffic as FTP, TFTP, Telnet, or Secure FTP? Are the combination of tcp 80 and those protocols an indicator of subversive data exfiltration, or simply a misconfigured service? At a minimum, it’s worth contacting the host owner for follow-up, or setting up a packet capture in the hopes of collecting the entire transaction during a future occurrence.

Let’s look at real-world examples

Limiting the usage of flow data to only post-incident investigations would sell short its potential as a means for detection and response. NetFlow works best for detecting threats where understanding the content of the communication is not paramount to identifying the attack. Simple NetFlow metadata provides enough detail to confirm the results from a play detecting any connections to known malicious IP addresses or networks.

NetFlow also performs well detecting policy violations. For example, let’s presume we have PCI or HIPAA segregated data, or a datacenter policy explicitly defining allowed service ports. We can detect potentially malicious traffic to these networks by searching for connections from external or on disallowed ports. Although it seems trivial, these types of reports are amazingly helpful. Consider the play where we monitor NetFlow for any outbound/Internet TCP connections to blocked ports at the edge firewall. Any traffic that returns from this report demands immediate investigation as it might reveal the firewall has been misconfigured or is malfunctioning. The best thing about these plays is that with proper network administration, they should rarely generate an alert.

A slightly more complicated use of NetFlow for detection is to use the data to detect UDP amplification (DoS) attacks. Simply put, UDP amplification attacks abuse the ability to spoof the source address of UDP packets and to send a very small amount of data to a service which will respond with a disproportionately larger amount of data to the spoofed source. The more disproportionate the response, the larger the amplification factor, and the bigger the attack. The first step is to block all UDP services susceptible to amplification from your network entirely. The next step is to detect those services via vulnerability scans so that you can get them patched, shut down, or filtered. Yet another step is to prevent spoofing attacks via your networking gear. However, on a large and complex network, blocking and proactively scanning for unwanted services often isn’t enough. For thorough defense in depth, you must also detect abuses of the UDP services. We’ll discuss this concept further in Chapter 9.

Besides proactive monitoring, NetFlow excels during investigations, acting as the glue between different data sources and providing thorough timelines of network activity. No matter the initial source of an event—IDS, web proxy, external intel, employee notification—NetFlow can be used to identify all connections to or from a host or set of hosts around the time of the original event. The number of use cases is nearly limitless. NetFlow can identify outbound connections initiated after malware is dropped on a host, the possible origin of lateral movement, data exfiltration via abnormally large or sudden transfers, or additionally infected hosts communicating with a known C2 server. Even though attackers can host their infrastructure anywhere in the world, NetFlow can help detect connections to unexpected locations around the globe. If your goal is to understand what transpired on the network during a given timeframe, NetFlow is the data source you’ll need.

Limitations (and workarounds)

As with all tools, none is perfect. NetFlow suffers from data expiration, directionality ambiguity, device support, and limitations from its use of UDP as a transfer protocol.

Realities of expiration

When a flow starts, the routing device stores information about the flow in a cache. When the flow is complete, the flow is removed from the cache and exported to any configured external collectors. There are five criteria a NetFlow exporter can recognize to know when a flow has completed and is ready for export to a collector:

§ Idle flow (based on a specified time)

§ Long-lived flows (30-minute Cisco IOS default)

§ A cache approaching capacity

§ TCP session termination

§ Byte and packet counter overflow

Proper TCP session termination—FIN or RST—is the most obvious. This implies the router observed the initial TCP three-way handshake, all the way through to a proper teardown. In this case, you can be reasonably certain the connection was properly terminated, though there is also the possibility a TCP Reset attack terminated the connection prematurely. Counter overflow becomes an issue when using NetFlow v5 or v7, as those NetFlow configurations use a 32-bit counter instead of the optional 64-bit counter available in NetFlow v9. Consider lowering your cache timeout of long-lived flows if using an older version of NetFlow that doesn’t support the 64-bit counters.

Long-lived flows, a cache near capacity, and idle flows all present somewhat of a problem when using NetFlow for analysis. Most NetFlow collectors can compensate for long-lived flows that have been expired from the exporting device, via an aggregation or stitching capability. This aggregation occurs during search time. For instance, the popular open source tool nfdump will aggregate at the connection level by taking the five-tuple TCP/IP values—Protocol, Src IP Addr, [source] Port, Dst IP Addr, [destination] Port—and combining that into one result with the flow duration and packet count:

2005-08-30 06:59:54.324 250.498 TCP 63.183.112.97:9050 ->

146.69.72.180:51899 12 2198 10

Why is this important and how might it affect your usage of NetFlow data? Consider a long-running flow for which you don’t know the beginning or end. If you query NetFlow data for that flow, does your tool only query within the window you specified, or will it pad the beginning and end, looking for additional flows to aggregate into the result presented to you? If the former, how can you be certain that your results include flows outside the time-window for which you queried? Some tools will in fact go back in time to look for flows that should be aggregated. As a best practice, you should consider extending your search to include a larger time range or testing your infrastructure to see how it reacts when expired flows span a time range greater than your query.

Directionality

In the context of security monitoring or incident response, we must always be able to determine the source and destination of any connection. Recall that by definition, a flow is unidirectional. You can get an idea of a connection’s direction by looking for the three-way handshake, and piecing together the source as the sender of the SYN and subsequent ACK, and the destination as the sender of the SYN-ACK. But what about UDP? What if the three-way handshake for the flow you’re observing happened well outside of the query window for the results available to you?

Unfortunately, some tools determine directionality strictly on port usage. Ports less than 1024 are considered “server” ports, and those greater than 1024 are considered client ports. For the majority of connections, this common port allocation holds true. However, security monitoring means tracking hackers, who by definition like to break things. Let’s refer back to the original NetFlow example discussed in the previous section. NetFlow identified the source host as 10.10.70.15 and the destination host as 192.168.10.10. However, the NetFlow query indeed incorrectly identified the source and destination. The authors specifically crafted a simple scenario where the collector improperly assigned source and destination tags to hosts in the flow. How then do you know if the client/server designation is a result of the port usage (client port 51617; destination port 80), of seeing a TCP three-way handshake, or something else entirely? Ultimately, you can and should test your infrastructure, as we did for this example. But you can also look at some NetFlow metadata to give you a better idea of how flows traverse your network.

NetFlow v9 exports include field 61 (DIRECTION), a binary setting indicating whether the flow is ingress or egress on the interface exporting flow data. If you know your NetFlow exporters (you do know your network, right?), being able to determine whether a connection was incoming or outgoing from a particular interface will help you to establish directionality. If you export flows from one interface on a border gateway device, and the flow DIRECTION field says the flow was ingress, you can be fairly certain the flow was coming from external and is inbound to your network. You still have the problem of knowing whether or not all of the flows for that connection are aggregated. But repeating this process for all observed flows between the two hosts in a connection will help you to identify the true source and destination of the connection.

Device support

Not every piece of network gear you have will support NetFlow (full or sampled) or DPI capabilities, nor do you necessarily want exports from all devices that do. Like all other tools, at a minimum you need to have visibility of all ingress and egress traffic to and from your environment. Most organizations won’t have enough storage and network capacity to collect flow data between all hosts on every subnet. But by all means, if you have a segment containing your crown jewels, consider exporting NetFlow data from the aggregation router of that environment. Where you lose visibility due to NAT, consider exporting NetFlow from both in front of and behind the NAT translation. Bear in mind, though, that depending on where you place your gear, you may end up exporting duplicate data. Can your collector aggregate and account for identical flows exported at different points in the network? Can you account for the increase in storage space due to duplicate flows?

UDP

NetFlow doesn’t inherently provide any confidentiality, integrity, or availability (CIA) of your flow data. Flows are exported using UDP as a transport protocol, and all the limitations of UDP data transfer equally apply to your flows. A lack of sequencing, handshakes, flow control, and potential for spoofing all contribute to an inability to uphold the holy trinity of CIA. As a result, any precautions you take for other UDP services should be extended to your flow data. Monitor for network saturation, which could cause potential packet loss and incomplete flow records. Ensure you have controls like Universal Reverse Path Forwarding (uRPF) to prevent spoofing attempts on the networks where you export and collect flow data. Overall, be aware that your flow data is subject to the same limitations presented by any service utilizing UDP for transport.

The essential pros and cons of NetFlow come down to:

§ Integral for both reactive investigative support and proactive detection plays.

§ From a security perspective, full (i.e., unsampled) NetFlow is imperative to have any chance of understanding a complete timeline of activity.

§ Modern features, such as DPI application identification, provide additional capabilities beyond those from basic NetFlow metadata.

§ Vendor support for NetFlow varies, and there are few exhaustive solutions that go beyond just flow collection.

§ Underlying NetFlow dependencies, such as UDP for transport and client search capabilities, introduce possible ambiguities and should be thoroughly understood to properly interpret and understand flow data.

DNS, the One True King

Let’s just start by saying: DNS is awesome. It’s fundamental to the Internet’s success and operation and has so many uses in the context of security monitoring and incident response. Without going into excruciating detail about how the protocol works, think how difficult it would be if rather than getting a street address from someone, you had to use latitude and longitudinal coordinates to find their house. Sure, you can do it, but it’s a lot easier to remember a simple street address. In much the same way, DNS provides us with an easier Internet location service. I can type www.cisco.com into my browser rather than [2001:420:1101:1::a]. If I’m using a search engine, searching for the DNS hostname of a website will get me there much faster. And to keep things interesting, it uses (mostly) UDP. Zone transfers (TCP) can be monitored through IDS logging or DNS server logs.

There are about forty DNS record types, many for obscure or not widely adopted DNS services. However, for our purposes we’re interested in mostly:

§ A (address record)

§ AAAA (IPv6 address record)

§ CNAME (record name alias)

§ MX (mail exchange record)

§ NS (nameserver zone record for authoritative servers)

§ PTR (pointer record, for reverse DNS lookups—resolving an IP to a hostname)

§ SOA (details about a zone’s authoritative information)

§ TXT (can be stuffed with all kinds of interesting info from malware)

These records provide the most usable information for data mining when looking for security events. Many of the other types related to Domain Name System Security Extensions (DNSSEC) or other applications are mostly relevant for troubleshooting, authentication, and DNS management.

Because malware authors are human like the rest of us, they too utilize DNS for much of their network communications. DNS hostnames are one of the most fruitful outputs of malware analysis. If we pick apart malware, we are not only looking to see what the program does to the victim’s computer, but we are also looking to see what outbound connections are made. Most often, communications leverage DNS rather than a raw IP address. Hostnames reserved by attackers are easy indicators to search for. Attackers may leverage dynamic DNS services to stay more resilient to take-downs, or they may actually purchase and reserve a list of domains from a registrar. In either case, when we have basic controls over the organization’s DNS infrastructure, we can detect and block any hostname or nameserver, including authoritative nameservers for huge swaths of domain names and hosts.

Leveraging DNS for incident response boils down to:

§ Logging and analyzing DNS transactions

§ Blocking DNS requests or responses

While many tools like IDS can be used for logging attempts by victims to resolve known bad hostnames (i.e., white/black list), the best approach is to log select DNS transactions using Passive DNS, or pDNS. Passive DNS packet collection gives you visibility into the DNS activity on your network that you won’t always get from the logs on your own recursive (caching) nameservers or from external DNS services.

To block requests to external (or even internal) hostnames, the most effective approach leverages BIND’s response policy zone feature (RPZ). RPZ allows you to substitute a normal response for a response of your own or no answer at all. This lets you lie to a client requesting a known malicious domain and instead tell it the domain doesn’t exist (NXDOMAIN). Taking it a step further, you can forge a response and point the client at a sinkhole hosting pseudoservices to collect even more information about malware that attempts to reach out to domains we redirect with RPZ. A honeypot approach combined with a DNS redirected sinkhole can be used to discover attack attributes useful for additional detection. For our purposes, we focus on pDNS and RPZ as the tools built on top of the DNS protocol, and leverage a sinkhole to collect additional intelligence as it unfolds.

Deployment considerations

Collecting DNS traffic or log data can be a challenge, particularly if you host your own DNS services or have a large network. There are also numerous ways to leverage the data once collected, and it all depends on your analyst capabilities and appetite for DNS-flavored metadata.

Little P, big DNS

Think for a moment how much network traffic crosses your organization’s border to the Internet. Each new connection that leverages DNS (most likely all of them) will generate a request to the authoritative DNS server. If your laptop needs to reach www.infosecplaybook.com, it will ask your organization’s DNS servers what IP address matches up with the hostname requested. If your DNS server doesn’t already know the answer (that is, have the A and AAAA records cached), then the DNS server will recursively ask the authoritative upstream DNS servers for www.infosecplaybook.com. This lookup may in turn require that the DNS server looks up the authoritative server for .com and so forth. This leaves you with two possible locations to log your client’s request: on the way in to your internal DNS servers, or on the way out, to DNS servers to external to you. Multiply this by the complexity of recursive lookups and all the clients attempting DNS resolution, and now you have millions of DNS queries and DNS responses. Even at a small organization, the number of DNS transactions can grow quickly.

There are a couple ways to tackle this mass of data. One way is to log the DNS transactions on the DNS server itself. Both BIND and Microsoft Active Directory (the two most common DNS server applications) provide options for logging client requests and server responses. With more logging comes additional burden on the server, including additional processing and configuration complexity. Server logging is certainly an option, but to remove any possibility of problems for the DNS admins brought on by your incident response team, the best solution is to capture DNS network traffic, extract the information you want, index it, and then make it available for searching. Just like a passive IDS, you can set up a pDNS sensor to collect specific traffic using something like libnmsg, or ncap, which according to the DNS Operations, Analysis, and Research Center’s (DNS-OARC) official site:

is a network capture utility like libpcap (on which it is based) and tcpdump. It produces binary data in ncap(3) format, either on standard output (by default) or in successive dump files. This utility is similar to tcpdump(1), but performs IP reassembly and generates framing-independent portable output. ncap is expected to be used for gathering continuous research or audit traces.

There are two sides to DNS activity that you can passively capture. For the purposes of monitoring and incident detection, the most valuable packets to capture are the DNS queries made by your clients. With all the DNS query packets, you can determine all the domains the client attempted to resolve, or the number of clients attempting to resolve a specific domain. The other side of the DNS transaction is the responses sent back to your clients. Seeing the DNS responses can be valuable for investigating a specific malicious IP address, domain, or an infected client, as well as monitoring the evolution of an attack campaign.

With the DNS responses, you’ll have responses for requests like “Show me all of the domains resolving to this IP” and “Show me all of the IPs this domain has resolved to.” The difference between the client queries and nameserver responses is more significant than it may seem at first. Not only do they support different aspects of the security investigation and monitoring process, but they also tend to support semantically different queries.

Client queries

If you’re going to successfully capture the query packets from clients, you have to capture as closely as possible to them. Most likely, the bulk of your client DNS traffic will be between your client and a “nearby” local nameserver. If the query packets can make it from the client to a nameserver without crossing a capture point, then instead of seeing the client make the query, you’ll see the nameserver’s data only when it queries recursively to upstream authoritative nameservers on behalf of the client. If your recursive DNS server deployment is relatively small, it may be possible to deploy a collector in front of every nameserver, and another collector at your network border to capture the stray DNS packets that weren’t destined for your local nameservers (think Internet-hosted DNS services—8.8.8.8, for example). Hosts resolving addresses using external DNS servers may have something to hide. If you have a big network with a complex local recursive nameserver deployment, you’ll need to take a more blended approach and capture DNS packets at network chokepoints just like the other tools. You may end up with good coverage of most DNS query activity, but still have some blind spots where clients are able to short-circuit your collectors and reach a recursive nameserver directly.

Server responses

Capturing the DNS query responses is a vastly simpler problem than capturing client queries. For the most part, every response you’ll ever be interested in is for an external domain name. Because none of your local recursive nameservers will be authoritative for external domains, all DNS responses will originate from external nameservers and cross your network border into your network at least once before the response gets cached. If you deploy one collector for each Internet connection you have, you can get complete coverage for your organization. Unlike client queries, though, having complete visibility of all of the responses your organization has seen may not be enough for thorough investigations.

Broadly, there are two main reasons why. First, you may learn about domains via external intelligence feeds, but you’ve never seen your clients look up any of those domains. You don’t have historical visibility into the responses received, which muddies the waters on whether the threat is still active. Second, the responses your clients are seeing may not be the same responses other organizations are seeing, and they may not be the same responses you’ll see tomorrow. No single organization has enough data to piece together a complete picture for the current hostility of a given domain. Therefore, global visibility into DNS responses is quite valuable. Global DNS response visibility sheds light on emerging threats, as well as threat actor groups, by profiling the data. Constant changes in domain names, freshly registered and recently accessed names, domain registration patterns, and many other indicators can be analyzed and used to develop block lists and additional monitoring reports. There are a few services and intelligence feeds that provide visibility, but the current leader is Farsight Security’s DNSDB service.

RPZed

With the undeniable importance and power of DNS, the ability to block or subvert the DNS resolution process can be very powerful for incident investigation, mitigation, and containment. Just as IP addresses are clumsy for human usage (can you imagine “Visit our website at 173.37.145.84 today!”), they’re also a clumsy mitigation measure for trying to block activity related to a domain name. The natural choice for blocking or redirecting a hostname is at the recursive nameserver used by the client. DNS RPZ provides a fast, flexible, and scalable mechanism for controlling the responses returned to clients, based on security policy triggers loaded into the nameservers configuration dynamically. Think of DNS RPZ as a DNS firewall that can filter out certain requests from ever succeeding, depending on your block or redirect criteria.

Four policy triggers to rule them all

For maximum flexibility, RPZ provides four different types of policies for triggering a security response instead of the intended DNS response (see Table 7-3). The most straightforward policy trigger is based on the name being queried by the client (QNAME). A QNAME policy for www.bad.com will tell the nameserver to not provide the normal response back to the client. The remaining three policy triggers are based on data learned by the nameserver in the process of recursing to resolve the queried domain. The IP policy trigger allows you to provide a RPZ response for any domain that resolves to a particular IP address. The other two policy triggers enable blocking of domains, based on the IP address of their authoritative nameserver or their authoritative nameserver’s name (the NS record).

Client request

Server IP address

Nameserver IP address

Nameserver hostname

QNAME

X

NSIP

X

IP Address

X

NSDNAME

X

Table 7-3. Four types of policy triggers

With these four policy trigger types, you can block or intercept the queries for huge blocks of malicious domains. For example, if your organization blocks any known malicious IP addresses or classless interdomain routing (CIDR) ranges, you can also RPZ all of the domains that would resolve to blocked IPs or that use nameservers you have blocked. Doing this means you can RPZ domains you didn’t even know your clients were looking up. Your RPZ logs provide great context for some of the domains in your pDNS Query logs. That is, if internal clients are repeatedly resolving known bad sites, there may be lingering infections or callbacks trying to succeed.

Don’t block; subvert

The real power of DNS RPZ isn’t just the ability to block queries your clients make. Because RPZ happens in the nameservers you control, you can configure RPZ to forge a fake response to the query and redirect the client to a machine you control (a sinkhole). With a sinkhole, you can emulate common services like HTTP, set up network detections in front of it, and collect logs of the requests being made or the data being sent. This is like giving a police department the ability to swap out a drug dealer for an undercover cop mid-transaction. With data like that, the police would be much better at tracking criminal drug organizations! The actual technical way DNS RPZ redirects queries to a sinkhole is by forging a CNAME record claiming the domain looked up is actually an alias to your sinkhole machine. When you combine the data recorded in your RPZ logs with the data in your sinkhole and pDNS systems, you can monitor for incidents much better than you’d be able to do without any DNS visibility or control. For example, you can set up HTTP, SMTP, IRC, or any listeners to intercept any communication attempts to these services on the intended domain, the one you redirected with RPZ to your sinkhole.

Let’s look at real-world examples

There are many useful plays available through mining pDNS and RPZ/Sinkhole log data. With some additional metadata, you can find new infections by analyzing sinkhole logs for potentially compromised systems querying for previously RPZ filtered domains. DNS filtering hampers the malware by preventing successful connections to C2 servers, but also leaves a log trail of compromised hosts still trying to connect to defunct attackers. Further investigation into common indicators and unusual DNS activity will yield additional conclusive results. Sinkhole log analysis could involve looking for:

Results with no HTTP referer

From the sinkhole HTTP service logs

Results to seemingly random hostnames (i.e., domain generation algorithms [DGA])

The following table breaks down the event into metadata elements (source, hit count, domains, URL, time range) and values (192.168.21.83, 64, /wpad.dat, etc.) Note the request for wpad.dat, or the Web Proxy AutoDiscovery JavaScript file, from these seemingly random domains:

Source

Hit count

Domains

URL

Time range

192.168.21.83

64

eumeiwqo.com

frtqgzjuoxprjon.com

idppqjvwwtfoj.com

jarigtvffhkgrvz.com

ohvxvkytfr.com

oisjuopdi.com

qrjnenmjz.com

qvcquqvjl.com

rqtdkahvoeg.com

uzmgyvgqctou.com

vdicplctstkpmjm.com

xmbeuctllq.com

xqflbszk.com

ygyfzxkkn.com

ysiefuwipz.com

/wpad.dat

4h

We could reasonably assume that malware on 192.168.21.83 is attempting to reach out to those domains, and that the event most likely does not represent normal activity on that system.

Clues in the URL_String that point to data beaconing or data exfiltration

For example, URLs containing in.php, id=, lots of base64 encoding, weak encryption/XOR, configuration file downloads, tracking scripts, authentication parameters, and others. In this example, you can actually see a binary file called cfg.bin posted by the likely infected client to an unusual remote web server:

2015-07-09 01:38:58.064865 src=192.168.21.183 client_bytes=5403

dest="dunacheka.meo.ut" dest_port=tcp/80 url="/admin/cfg.bin"

http_method="POST" "http_user_agent="Mozilla/4.0 (compatible;

MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727;

.NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E;

InfoPath.2)"

This table provides additional examples of potentially malicious URL parameters:

Source

Hit count

Domains

URL

Time range

192.168.21.89

32

bro.dubaiii.net

/pagetracer/duba/__utm.gif?param=RURJSxAAAABKAAAAAQAAAHicS0wuyczPy0vMTbVNzMksS1UrqSxItTU3MrCwNFMrLc1MsTWyMDJ2cjE2MHQxc3ZxdHU0dnJ2dDY0NzQ1dnZxMjZ0AgDVEhNK

4h

Note the “pagetracer” in this example, followed by the “param” parameter loaded with a base64 value. Although having large base64 values for URL parameters is commonplace, looking at the name of the script combined with an idea on the validity of the domain can help shake out potential attackers or fraudulent activity.

The following table shows the client host 192.168.21.52 attempting connections to oddly named domains at presumptively obfuscated scripts. Most likely, these connections are not legitimate nor requested by a human:

Source

Hit count

Domains

URL

Time range

192.168.21.52

14

4jun3vxnu2o376llv4ynuydu5xhgwtvjqqfagcm7rfclhiwe7rmpz6eify.wonderful-nature.org

ezwobvb2qivshlekef2ti4v7ia7tz7jhjtkmguk5yjoxhvklc32y27klde.wonderful-nature.org

hp7xx2csnhfoo2iw5izgv235tdfiag4wmq3cmdysnhcxa6zhbhgh7ktoum.wonderful-nature.org

l2aajjixxjspq7los7r2ebweo37at5ywiopfzf7mrwomnwp7fyin2seaby.wonderful-nature.org

/x/?AFwVKo11t4mJnU2lWxFQtOc=

/x/?RQHbZiOsJ5/n7yP4hq+HyWM=

/x/?Y/lOzY7y81Fqwd/u5nS0jlo=

/x/?ddCCQjRTxrdPgtuUx5I5wjc=

/x/?eKhqTJcoo/pIZ117fSOqDGQ=

/x/?hZ4qsvawxTVnlb9bNxN+c54=

/x/?loNtE6yuyk1Tuxn3XZ1WJAc=

4h

A higher count of lookups to that domain over time

Spikes in requests for a particular domain, particularly for domains that are rarely seen in your environment, could indicate a rash of new system compromises and C2 traffic.

Details known about the domain

For example, sourced from internal data, Google, Urlquery, threat feeds, or others. Researching domains and URLs with third-party feeds, or by enriching your DNS source with threat intelligence, can significantly improve your detection rate. Knowing what requests (and clients) to flag as suspicious based on confirmed reports of malicious activity makes the approach no more than simple pattern matching. In any case, blocking bad domains and waiting for victims to resolve them again may also enable you to collect additional indicator information from their clients such as the hostnames they have recently resolved or unusual flows they may have created.

As with any investigative play, the query must be tuned, removing confirmed false positives each time it’s analyzed and compared with other data sources for corroboration. Additionally, correlating the source IP, source host, or username with additional data sources (HIDS, AV, IDS, web proxy) may also show more suspicious activity and help confirm or refute the activity.

RPZ and sinkhole monitoring is also valuable when a malware outbreak or massive campaign-based attack occurs. Many exploit kits bundle and drop ransomware like Cryptolocker, Cryptowall, CTB-Locker, and others, or deploy them shortly after other infection vectors have succeeded. It infects a computer, encrypts personal files, and then demands ransom be paid in a short period of time before the attackers/extortionists delete the files. All the Cryptolocker-style infections call back to a DNS hostname; however, due to their huge infrastructure, there can be thousands of possible names to check against. Their domain name generation algorithm was eventually discovered, and we were able to proactively RPZ all the Cryptolocker callback domains to minimize the damage.

RPZ can also help with accidental data leakage and sneaky tricks by blocking typo’d versions of your domain name or partners. It can also reduce some adware served up by domain parking services that leave a token advertisement at the site of an unpurchased/inactive domain. Dynamic DNS services, very popular with attackers, are easy to block entirely by simply adding the authoritative nameserver for those domains to your RPZ filter.

Finally, we also track and measure our team against how many infections or problems we detect internally before an external entity reports something. For this reason, we generally sinkhole third-party sinkhole nameservers to get access to the same data in our local RPZ. When we see internal clients attempting to resolve a domain that’s attached to a known sinkhole, we know the client is infected and don’t pass that information externally. Microsoft and U.S. federal agencies have shut down large botnets, widespread malware campaigns, and have sinkholed thousands of domains. Blocking the nameserver for the sinkhole with RPZ keeps any infection data we have local, offers us some additional privacy, and gives us the benefit of additional and measurable protection.

Limitations

The pDNS data provides domain name metadata analogous to how NetFlow serves up IP metadata. However, just because a client resolves a known bad domain doesn’t always mean the cause of the lookup is malicious, or that the client is infected. It also doesn’t mean any traffic was ever sent to the domain by the client. Although having only one bit of metadata from DNS can be a smoking gun, it isn’t always, and more context is usually needed. To confirm an event was malicious, pDNS/RPZ logs must be used in conjunction with other defense-in-depth data sources like your NetFlow or sinkhole logs.

It’s also important to realize that you still have to maintain a DNS collection infrastructure, even if you already have a log management system in place. Much like the web security proxies, DNS data, along with RPZ mitigation capabilities, offers a precise view into a commonplace protocol. Still, it is a disparate source to maintain in conjunction with your IT groups. After all, DNS is a critical and foundational service. Moving around zone files, editing configurations, and adding new services needs to be done with a circumspect approach and in communication with all the DNS stakeholders in IT and the organization.

Another major limitation with pDNS deployments for large networks arises when there are multiple tiers of DNS services behind an organization’s primary nameservers. Take, for example, a university department (say, biology) providing name resolution services to members of their domain through departmental Active Directory servers. Rather than querying the central IT network’s authoritative DNS servers for the school, clients in the biology department request name resolution to the nameserver provided to them by Active Directory. Generally, this is the domain controller itself, and if you detect resolution of known bad or RPZ’d domains from this session, you will only see the source IP address of the domain controller, and not the client who initially made the request. Naturally, this makes attribution difficult, especially if the domain controllers are not logging their DNS requests. There is no current solution for pDNS collection on Windows Domain Controllers. The only option is to log the DNS service and its transactions.

The essential pros and cons of DNS monitoring and RPZ detection come down to:

§ DNS provides a fundamental source of data used in most communications, and therefore provides a wealth of information for security monitoring.

§ RPZ can shut down attacker C2 services and provide insight into your internal client activity.

§ Domain names are very common indicators shared by various groups, and it’s important to have the capability to know where and what your clients are resolving.

§ Because DNS is a critical (and deceptively complex) service, you must take caution in changing the configuration parameters, BIND versions, or any components of your DNS architecture to avoid outages.

§ Not every organization runs their own DNS servers, but packet capture can still intercept request and responses from internal clients to Internet-provided DNS services.

HTTP Is the Platform: Web Proxies

A few years ago, we took a hard look at our detection infrastructure to determine where we might improve our capabilities. Realizing that 33% of outbound packets used HTTP, it became abundantly obvious that an investment in this area would have the biggest impact. At the time, the only web proxies on the network served as caching services to improve WAN link performance and to reduce bandwidth costs. If a client can load common files from a local HTTP cache and avoid using an expensive WAN or Internet connection, performance improves and bandwidth is conserved. However, the caching proxies only improved performance, and offered no additional security protections. In fact, the proxies actually masked the true client source IP address behind the proxy, increasing our time to respond. That is, when our IDS or other tools alerted on outbound HTTP traffic, we could only trace back to the proxy’s IP address rather than the original client host.

NOTE

Configuring and adding the Via and X-Forwarded-For headers can help upstream detection determine the original client IP behind a web proxy.

Pairing our TCP utilization numbers with the fact that increasing volumes of exploits were sourced from compromised websites, it was an obvious choice to expand our detection capabilities beyond IDS to more precise and flexible web monitoring.

It’s critical to ensure a safe web browsing environment for employees to protect the business, intellectual property, and communications with each other and customers. The weakest link in our security defense-in-depth strategy was our lack of controls around Internet web browsing. Some IP enabled embedded devices, unreachable by client security software, were often not under IT control or even patchable, meaning malware and outsider control could have manifested itself in these hard-to-protect areas. Balancing web and browser security versus openness and a culture of research, development, engineering, and well, Internet, made it a difficult but rewarding process in the end.

Deployment considerations

Web proxies allow you to solve additional security problems at a more precise and scalable layer than NetFlow, or IDS, but still on the network level. Web proxies collect only web browsing information, which means client requests and server responses. Their narrow focus affords capacity and confidence that more attack patterns and traffic can be identified than broader scoped tools. At some point, every attack must have a callback component to notify the attacker of their success. Most commonly, callbacks occur over HTTP and on TCP port 80. Many organizations allow outbound TCP port 80 on their firewalls; therefore, the callback has a better chance of connecting and avoids filtering.

WARNING

Many callbacks also occur on TCP port 443 over SSL. Some proxies have the ability to inspect SSL sessions, but only after requiring their own SSL certificate to install the client, who must agree to allow their SSL traffic to be decrypted once before it reaches its destination.

Even if an authenticated proxy is required to access outbound Internet resources, malware can take advantage of existing sessions and proxy settings on the victim’s system. Because callbacks are leaving via HTTP, you need a proxy in place that can detect, log, and if possible, block incoming malware or outbound callbacks. A web proxy brings the ability to block web objects (HTML, plaintext, images, executables, scripts, etc.) based on preconfigured rules (signatures), intelligence feeds, or custom lists and regular expressions. This last component is key: any regular expression you can develop that identifies a malicious HTTP transaction can be used to develop a playbook report. The vast majority of our playbook centers on our web proxy logging and analysis.

Depending on the proxy product, there will be several possible configuration options. Most all professional grade proxies support:

§ Web caching and proxy

§ Numerous redirection methods

§ High availability or failover

§ Substantial logging facility

§ SSL inspection (man-in-the-middle)

§ Malware detection and blocking

§ Threat intelligence feeds

§ Custom policies and filters

For a smoother transition and easier support, inline transparent proxies offer the best approach. Transparent means the proxies are unknown to the web browsing clients as they pass through typical Internet web traffic. There are no client settings to modify, no proxy auto-config (PAC) files to create and distribute, and little to no support issues for configuration. To transparently proxy, however, you must employ either an HTTP load balancer or content routing protocol like Cisco’s Web Cache Communication Protocol (WCCP) on a chokepoint router. WCCP can intercept outbound HTTP requests (or other protocols depending on what service groups and ports are configured) and redirect them anywhere, most likely to your web proxy anxiously waiting to make a decision to forward or drop the request. The client has no idea their HTTP request has hit a proxy, and won’t know unless they look up their IP address on a remote web server and see it’s actually the proxy’s IP address. Other clues that you are behind a proxy can be found in the outbound HTTP headers. Properly configured, a web proxy can append additional headers like Via or X-Forwarded-For to each request, indicating the original client source IP. Configuring these headers also helps to identify client traffic from behind other proxies. When the Internet-facing security web proxy receives a web request from an internal caching proxy, if the caching proxy includes one of the client identification headers, the security proxy can recognize, log, and allow or deny that traffic. In any case, you now have a true source IP to investigate versus a possible dead end with just a caching proxy server IP.

GET / HTTP/1.1

Host: www.oreilly.com

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0)

Gecko/20100101 Firefox/29.0

Accept: text/html,application/xhtml+xml,application/xml;q=0.2,*/*;q=0.5

Accept-Language: en-US,en;q=0.5

Accept-Encoding: gzip, deflate

Connection: keep-alive

HTTP/1.1 200 OK

Server: Apache

Accept-Ranges: bytes

Vary: Accept-Encoding

Content-Encoding: gzip

Content-Type: text/html; charset=utf-8

Cache-Control: max-age=14400

Expires: Sat, 17 May 2015 07:08:42 GMT

Date: Sat, 17 May 2015 03:08:42 GMT

Content-Length: 18271

Last-Modified: Fri, 16 May 2014 18:43:57 GMT

Via: 1.1 newyork-1-dmz-proxy.company.com:80

Connection: keep-alive

and

Connection: keep-alive

Host: query.yahooapis.com

Cache-Control: max-age=0

Accept: */*

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.

36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36

Referer: http://detroit.curbed.com/archives/2014/09/the-silverdome-54-photos-

inside-the-ruined-nfl-stadium.php

Accept-Encoding: gzip,deflate,sdch

Accept-Language: en-US,en;q=0.8

Cookie: X-AC=ixJG0Qqmq9R; BX=923h6vl88u22kr&b=4&d=_mbiZA5pYEI5A0OR2p

6p_g45v8y9reARiupeHw--&s=83&i=mSeKQNWKVRy3IESGPi5i

X-IMForwards: 20

X-Forwarded-For: 10.116.215.244

Transparency is a critical part of a successful and accepted deployment, but where to deploy the proxy also makes a big difference (Figure 7-3). Deploying to an Internet facing connection reduces the total amount of proxies necessary for internal scrubbing, and can simplify configuration. Placing web proxies at the outer layer can also avoid conflicts with internal caching proxies that are now downstream.

NOTE

An added benefit of deploying directly at the Internet uplinks for all internal networks are the layer two performance boosts offered by WCCP. Connecting your proxies directly to the device running WCCP can significantly improve performance in layer two mode, as redirection processing is accelerated in the switching hardware and avoids the overhead of software switched, Layer 3 generic routing encapsulation (GRE).

Web proxy deployment

Figure 7-3. Web proxy deployment

The one primary side effect to a transparent security proxy is that clients will not receive certain pages or objects they requested in their browser. Depending on the proxy, a configurable error message might appear, or simply a blank space on the browser window. It’s important to remember that this is a great chance to put text in front of a user for basic security awareness, with links to why the site or object was blocked and how to get support. We spent almost as much time crafting the language on the error pages (returned by the proxy with an HTTP 403 error) as we did in the project’s design (Figure 7-4).

Web proxy blocked request notification

Figure 7-4. Web proxy blocked request notification

Logging is another major consideration when deploying the secure web proxies. As mentioned earlier, 33% of all our traffic is HTTP. Each day, we log and index around one terabyte of HTTP browsing metadata. Be prepared to store, index, and recall large volumes of HTTP transaction data. It turns out that only between 1%–3% of all traffic is automatically blocked by threat intelligence feeds and anti-malware blocking. However, that small percentage represents millions of blocked objects that could have otherwise negatively impacted the web browsing clients. The rest of the transaction data is still extremely valuable, as it now can be searched and profiled for what reputation scoring and signatures could not detect. Applying self-discovered intelligence to the web proxy log data will yield surprising results. If you create a generic query, you can find even more versions of the same attacks from different entities. Beyond malware, browsing data is a gold mine for investigating data loss, fraud, harassment, or other abuse issues. If you can see what sites are being used by what people, you can collect evidence of unsavory activity through the log data.

Threat prevention

Security proxies should provide several methods for preventing the import of malware through the network borders. They should also provide adequate protection and detection of suspicious outbound traffic. Callbacks represent a particularly valuable piece of information when it comes to a phishing response. If sysadmins or executives are successfully phished, the link they click will go through a proxy that can either stop the damage from occurring, or at least provide a log for future investigation. Blocking access to phishing-referenced links can protect some of the most valuable assets through the high-risk vectors of email and curiosity. Other important callbacks to detect are made by infected clients responding to check-in or health scripts run by the attacker. When a host is successfully compromised with an exploit kit or otherwise, to extract data or send commands to the victim, the attack must have some connection to deliver its instructions. In some cases, systems sit idle, periodically checking in until access to them is sold to another attacker. In both cases, a check-in status almost always appears to let the attackers know the host is still on, up, and under control.

A web proxy can distinguish and detect watering hole attacks, drive-by downloads, and other HTTP attack types. Simply put, any attack involving HTTP will go through your proxies and create a log file that can be used to develop reports. The more common a pattern and the easier to distill into a regular expression, the easier detecting exploit attempts will be. Most commonly used exploit kits attempt to foist as many exploits as possible or relevant onto a victim host in the hopes of getting a successful load. The exploit detection is best handled by client software as it attempts to execute. This only comes into play, however, if the proxy failed to block the exploit download attempt. To actually deliver an exploit, the kit must have a landing and subsequent loading page, typically via PHP or HTML, often in iframes. It’s at this point where the web proxy can detect and block the connection.

Let’s look at real-world examples

Like pDNS collection, because web proxies provide so much meta detail about ubiquitous network traffic, they provide the perfect test lab for discovering security incidents. Web proxy data is a great place to start developing effective plays.

Backdoor downloads and check-ins

The venerable password-stealing Zeus Trojan provides an excellent example of how best to leverage the security web proxy. Even from its earliest versions, the Zeus bot downloader was served from an exploit kit hosted on a compromised website. Once an exploit successfully compromised a client system, the host would soon after make an HTTP “POST” to a check-in script hosted by the attackers. Most commonly, Zeus authors named the script gate.php. To detect a successful compromise, all one has to do is look for this specially crafted POST to any URL ending in gate.php. Of course, there may be legitimate sites that also run scripts called gate.php, so a bit of investigation is necessary; however, we further improved the detection by alarming only when a POST goes to gate.php and there was no HTTP referer. That means there was no previous link leading to the gate.php script, and the client connected directly to the script. HTTP requests with no referrer only occur when someone types a web address directly into their browser navigation bar, or if an application creates a web request. In this way, it’s much easier to distinguish between human-generated traffic and computer-created traffic (the latter potentially representing malware, like Zeus bot).

To further improve precision, we can add additional regular expressions on fields like URL or User-Agent, based on how the exploit kits are reconfigured or shifted around. In this example, we can see the query we developed to find common Zeus bot compromises in the web proxy logs:

"gate" AND "php" AND cs_url="*/gate.php" AND "POST" (NOT (cs_referer="*"))

Here’s the result:

1430317674.205 - 10.20.12.87 63020 255.255.255.255 80 -5.8

http://evalift0hus.nut.cc/Spindilat/Sh0px/gate.php - 17039 798 0

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/5.0;

SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729;

Media Center PC 6.0; .NET CLR 1.1.4322; InfoPath.3)" - 503 POST

In this case, host 10.20.12.87 attempted an HTTP POST to a gate.php script hosted on evalift0hus.nut.cc with no apparent referer. Looking at the novel domain name (recently registered), examining the site itself (does not look like a legitimate website someone would type into their browser bar), and factoring in other web browsing data from the client, it’s clear this is an example of a true positive, a Zeus infection. To contrast with this example, we’ve removed the no HTTP referrer requirement and have run the report again. The following example shows a hit that represents a false positive:

1403077567.205 10.20.12.61 15873 199.107.64.171 80 - 4.9

http://www.idsoftware.com/gate.php - 263 607 2589 "Mozilla/5.0

(Windows NT 5.1; rv:28.0) Gecko/20100101 Firefox/28.0"

text/html 302 TCP_MISS "http://www.idsoftware.com/gate.php" POST

The client 10.20.12.61 attempted an HTTP POST to a gate.php script hosted on http://www.idsoftware.com. On further investigation, we can see that there is actually a referer (http://www.idsoftware.com/gate.php) that sends an HTTP redirect (error code 302) to itself. Visiting the site itself (from a secured lab browser), it’s clear that gate.php is a script that requires visitors to register their age before proceeding through “the gate” to the main website. The site is completely benign and harbors no threat of malware attack.

The differentiator here is the absence or presence of a referer. Certainly, attackers can modify their code to include benign referers in their check-in requests, but at the moment, there are plenty of exploit kits that don’t include this additional layer. It’s easy to find Zeus bot check-ins using this method, and in fact, we have successfully detected other malware families with the same report. Lazy attackers often use their exploit kit default settings, and gate.php is a common, if not default, script name.

This might seem like an overly simplistic example, yet this query is extremely effective in finding this particular malware, and its logic and methods can be easily adapted to detect additional malware strains using a similar exploit vector. It’s also important to understand that the hosts making these connections are infected with malware already. The source hosts need to be completely reinstalled to ensure all traces of the backdoors and installers are removed.

Exploit kits

A web security proxy can also be used to detect attacks at earlier stages of exploitation. We already know it can detect callback traffic, but it also can detect exploit attempts. Detecting an exploit attempt is only partially useful. Only when we can confirm an exploit attempt has succeeded can we take significant action. Simply put, we expect exploit attempts—it’s why we deploy monitoring gear and people interpret its data. We log all exploit attempts and analyze them for veracity and their impact, but a single alarm of an exploit attempt doesn’t provoke any action besides further investigation. However, combined with multiple other security event sources at various layers, proxy logs are indispensable.

For the common exploit kits deployed by those in the crimeware ecosystem, the proxy will hold its ground and deliver plenty of investigative detail. In the preceding example, we detected a post-infection callback. We could also detect the actual exploit kit infection attempt. Websites can be compromised by attackers through a variety of vulnerabilities and techniques.

NOTE

For hundreds of victims to fall prey to web-based attacks, websites themselves must first be compromised. Trojans like Gumblar, ASProx, and others have attacked website administrator credentials, as well as exploited vulnerabilities in content management software.

When a victim visits a compromised site, conveniently modified by the attacker to include an exploit kit, the process begins again, only this time the clients are the targets. The exploit kit will test the visiting client’s browser for plug-ins and plug-in versions to determine what exploits to use, or simply try them all regardless to see which ones work. It’s at this point we can leverage the web proxy for detection. Because we don’t control the compromised website, we can look at client behavior to find the attacks.

The following example shows an internal client browsing the web and then getting redirected to an exploit kit.

The client (based in Singapore) begins by intentionally browsing to the WordPress-powered site for a preschool in Singapore, shaws.com.sg. As soon as they reach the site, some interesting connections are made to lifestyleatlanta.com and www.co-z-comfort.com hosted on the same IP address:

1399951489.150 - 10.20.87.12 53142 202.150.215.42 80 - 0.0

http://www.shaws.com.sg/wp-content/uploads/2013/03/charity-carnival.png

- 329 452 331 "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64;

Trident/6.0)" - 304 TCP_MISS "http://www.shaws.com.sg/"

- - - - - 0 GET

1399951490.538 - 10.20.87.12 53146 46.182.30.95 80 - ns

http://www.lifestyleatlanta.com/hidecounter.php

- 990 316 548 "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64;

Trident/6.0)" text/html 404 TCP_MISS "http://www.shaws.com.sg/"

- - - - - 0 GET

1399951492.419 - 10.20.87.12 53145 46.182.30.95 80 - ns

http://www.co-z-comfort.com/hidecounter2.php -

3055 313 10835 "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64;

Trident/6.0)" text/html 200 TCP_MISS "http://www.shaws.com.sg/"

- - - - - 0 GET

1399951493.305 - 10.20.87.12 53142 202.150.215.42 80 - 0.0

http://www.shaws.com.sg/favicon.ico - 304

215 370 "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64;

Trident/6.0)" image/vnd.microsoft.icon 200 TCP_MISS

- - - - - - 0 GET

Shortly after those connections (as coded on the hacked WordPress site), you can see the client 10.20.87.12 attempting to GET the proxy.php script from yet another domain with a few parameters like req, num, and PHPSESSID:

1399951719.307 - 10.20.87.12 53187 255.255.255.255 80 - ns

http://yoyostylemy.ml/proxy.php?req=swf&num=5982&PHPSSESID=

njrMNruDMlmbScafcaqfH7sWaBLPThnJkpDZw-

4|MGUyZmI5MDNlMzJhMTIxYTgxN2Y5MTViMTJkZmQ0Y2I 1260 576 6531

"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"

application/octet-stream 200 TCP_MISS

"http://yoyostylemy.ml/proxy.php?PHPSSESID=

njrMNruDMlmbScafcaqfH7sWaBLPThnJkpDZw-

4|MGUyZmI5MDNlMzJhMTIxYTgxN2Y5MTViMTJkZmQ0Y2I" GET

The last log indicates PHP’s session-tracking mechanism, and num likely references either a random number or a number assigned to the victim. The interesting option is the req, and in this case, req=swf. Most likely, this means the exploit kit was attempting to attack the client browser’s Adobe Flash plug-in with a malicious Small Web Format (SWF) file. The 200 code indicates that the client successfully connected to the remote site; however, there’s no additional data in the request, or even subsequent HTTP requests that show a successful compromise. All we know is that the client connected:

1399951499.025 - 10.20.87.12 53148 108.162.198.157 80 - ns

http://yoyostylemy.ml/proxy.php?req=swfIE&&num=3840&PHPSSESID=

njrMNruDMlmbScafcaqfH7sWaBLPThnJkpDZw-

4|MGUyZmI5MDNlMzJhMTIxYTgxN2Y5MTViMTJkZmQ0Y2I - 1274 514 6430

"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"

application/octet-stream 200 TCP_MISS

"http://yoyostylemy.ml/proxy.php?PHPSSESID=

njrMNruDMlmbScafcaqfH7sWaBLPThnJkpDZw-

4|MGUyZmI5MDNlMzJhMTIxYTgxN2Y5MTViMTJkZmQ0Y2I"

- - - - - 0 GET

Next, let’s add to the investigation information from a host-based log:

AnalyzerHostName=ERO-PC1|

AnalyzerIPV4=10.20.87.12|

DetectedUTC=2014-05-13 03:54:05.000|

SourceProcessName=C:\Program

Files\InternetExplorer\iexplore.exe|TargetFileName=

C:\Users\epaxton\AppData\Loca\Temp\~DF43538044D73DACA6.™|

We can see that the file made it through, was executed, and detected by the host IPS. Now that we have confirmed this is indeed an exploit kit, and that the attacks are almost successful, we can create a report to look for this same activity using a regular expression to match up with the URL/kit parameters. Additionally, we can add any domains we find to a DNS RPZ, block all connections to the IP address hosting these names, or simply add the hostnames to the proxy block list.

Limitations

Like all the tools discussed here, there are significant limitations to using a security web proxy. However, in the spirit of defense in depth, and because of their high yield rate, we advocate deploying web proxies wherever it makes sense. Keep in mind that there are still limitations and challenges to any organization considering a proxy solution. A proxy has to go in front of everyone’s traffic: performance problems and outages are more obvious. When email goes down in an organization, everyone notices. It’s almost as bad as the power going out. Today, the same goes for HTTP. If employees cannot use the web for business applications, operations will grind to a halt. If there are significant delays watching broadcast video because of the proxy, most people will notice. There’s also the chance that someone is using an application that’s incompatible with an HTTP proxy.

Depending on the configuration, WCCP and the proxy can miss HTTP on nonstandard ports. WCCP and other techniques can redirect HTTP on any TCP port to your proxy. Although the vast majority of HTTP occurs over port 80, many applications opt for alternative HTTP ports. Malware is no different, and some samples will make HTTP connections on any random port. Because of their prevalence in malware callback communication, ports 80, 81, 1080, 8000, and 8080 are good choices to include in the redirect service group toward the proxy. It’s not scalable to add all 65536 TCP ports to your redirect list for the proxy. In these cases, an IDS or NetFlow (with NBAR or the equivalent) may serve a better purpose. Many proxy applications will support more than HTTP as well, and application proxies like for FTP or SOCKS are great places to retrieve even more log data.

SSL inspection will also cause issues. Intercepting and logging encrypted web traffic means breaking the fundamental trust model in SSL and inserting your organization between your clients and remote web servers’ encrypted communications. This brings up a whole host of potential issues. Performance issues notwithstanding, compatibility issues (e.g., Shared Spanning Tree Protocol), configuration complexity issues (additional certificate authorities), certificate management, and potentially privacy, legal, or regulatory issues are all part of the baggage. In general, it’s great to have the capability to read encrypted traffic streams for the purposes of incident response; however, it’s not without significant disruption to normal web and SSL services.

Another limitation is that it’s possible the proxy may not get the full HTTP session headers like an IDS or pcap would. To cope with scalability and decrease log complexity and size, many proxies only log and alert on top layer components in the HTTP headers, such as URL, host IP, referer, and HTTP actions. IDS is a better fit for lower-level header inspection when a precise signature is available.

The essential pros and cons of web proxy log monitoring come down to:

§ HTTP data is fundamental to modern networks and many applications offer control and visibility into HTTP traffic.

§ HTTP traffic’s prevalence makes it a worthwhile log source to target.

§ Many simple reports can find malicious activity using web proxy data alone.

§ Deploying a proxy between users and their web content can require configuration overhead.

§ Even with WCCP, visibility into HTTP traffic on nonstandard web ports may be missed.

§ Performance issues or outages as a result of the proxy are highly visible to end users.

§ SSL inspection has issues beyond a technological aspect.

§ A proxy may not be able to capture full session headers.

[rolling] Packet Capture

In a perfect world, we would have full packet capture everywhere. In fact, in many small environments, this capability exists. What better way to provide evidence for an investigation than to simply study or even replay a network conversation that has already occurred? However, implementing successful ubiquitous packet capture demands great resources and engineering effort to achieve.

Deployment considerations

As with the other event sources, specifically where to deploy remains a key question. Packet captures are helpful anywhere we want to re-create an attack scenario or determine what already happened. Logging all packets into and out of the internal network can work, with proper filtering, storage, indexing, and recall capabilities. A rolling (constant) packet capture can be an acceptable solution for querying near-time packet data, with the option of searching historical data in more long-term storage. Beyond the basic ideas of capturing and storing packet data, the recall of data is just as important as other event sources. Will you be able to not only capture packets, but also index their contents to make searching and log mining possible? If so, consider the additional size imposed by building an index or loading the packets into a database.

For rolling packet capture to work, you must also filter out unreadable or unusable data. If you are looking for payload information, IPsec or SSL traffic (unless you already have, or later discover, the private keys) is practically worthless from a packet capture standpoint. If you are only looking for network metadata, you are better off leveraging NetFlow. Broadcast, multicast, and other chatty network protocols can also be filtered to reduce the total size of the packet capture.

Let’s look at real world examples

Using a network packet capture appliance or switch module, you can set up packet triggers to only log data when a particular condition occurs, or you can preconfigure a capture filter and wait for it to fire again. In either packet capture scenario, by definition, it must be a reactive approach. Rather than attempting to record all conversations, the bulk of which are meaningless, having an ad hoc solution deployed in key areas (Internet edge, client networks, datacenter gateways, partner gateways, etc.) will provide you with precise data, as well as a solid foundation for developing detection methods, based on packet data, on other event sources. There have been many times where we have taken sampled packet data and developed reports using NetFlow or IDS based on the original packets.

Rather than spend the millions it would take for a robust commercial packet capture, storage, and retrieval system, we opted for the ad hoc solution with the option to develop reports based on metadata we pulled from packet data investigations into other event sources. We still use packet capture on a regular basis, but depend on log data from our other systems and detection logic to study or re-create attack scenarios. If an attack works once, it will probably be attempted again, and when our ad hoc packet capture is standing by, we can use its output to enhance our other widely deployed monitoring tools by replaying captured traffic and testing our detection capabilities.

Limitations

While packet capture can provide the full, historical record of an incident, there are some prohibitive overhead components and technical concerns that can make it a less attractive event source. In most cases, the cost of a full packet capture program for a large organization is astronomical when factoring in all the computing horsepower and storage requirements. Small environments or targeted areas work best to avoid an expensive storage and recall system. Also remember that packet captures, while complete, are still just raw data. Without knowledge of what to look for and where, it doesn’t offer the instantaneous response possibilities present in other, more metadata-focused event sources.

As with other event sources, encryption can partially thwart your efforts. While you can deduce certain assumptions based on simple ideas like the fact that a flow occurred between two hosts, or that a single host had a long-running connection, without the decryption keys (and in some cases without the right timing), encryption still prevents analysis of packet contents.

The essential pros and cons of full packet inspection come down to:

§ Packet captures provide a full, historical record of a network event or an attack. No other data source can offer this level of detail.

§ Packet data has not been summarized or contextualized automatically and requires understanding and analysis capabilities.

§ Collecting and saving packet data takes a lot of storage, depending on archival requirements, and can be expensive.

Applied Intelligence

The U.S. Department of Homeland Security maintains the National Terrorism Advisory System. This system attempts to inform the American populace of imminent or elevated threats to their personal safety as a result of potential terrorist aggression. This is a physical security threat intelligence system. However, if there’s a credible threat, what are you supposed to do? What does elevated or red mean? There are no specific instructions other than basically, “stay tuned for more details.” This is an unfortunate tautology. If this were part of the security incident response process, we’d be stuck on the preparation phase. To have effective threat intelligence, you need more than just colors and strong words. You need to know what threat intelligence feeds or digests are available to you. You’ll need to consider how to manage threat indicators after you’ve received them. You need a system to manage the repository of intelligence data, as well as a way to manage contextual information like indicator relationships. Intelligence data on its own is not terribly helpful, but when used to color network events from your organization, it can be enlightening.

NOTE

Threat intelligence is tactical information about adversaries, tools, techniques, and procedures (TTPs) that can be applied to security monitoring and incident response.

The hope, when subscribing to a threat intelligence feed, is that you will receive vetted, actionable information about specific threats to your organization. This means that you can take the threat intelligence data and actually use it for incident response. Threat intelligence alone is like the National Terrorism Advisory System—you have information on credible (possibly even confirmed) threats, but no real information or strategy on what to do. It’s up to you to decide what to do with the intelligence when you receive it. Does the data overlap with your other commercial or freely available feeds? Do you trust their results and can you corroborate the detection? If you are automatically blocking hosts based on third-party intelligence, what happens if you get bad data? If you are sending your CTO’s laptop up for remediation, you’d better be confident you made the right call. If it comes down to a decision to reimage 5,000 hosts on your network, are you prepared to put full trust into this feed and defend its findings?

Deployment considerations

To supplement the intelligence you are developing internally, you can use third-party threat intelligence feeds to let you know what problems you already have on your network and to prepare for future incidents. This can be helpful, especially for organizations with no CSIRT, or an under-resourced security or IT operations group with no time to research on their own. Intelligence feeds typically come as a list of known indicators of malicious activity broken out into metadata such as IP address, DNS hostnames, filenames, commands, and URLs. Using feeds will enhance your existing data with additional context that can be used for detection. For example, if an IP address fires a seemingly benign alarm in your IDS events, yet it is tagged as belonging to a blacklist of IP addresses from a threat intelligence feed, your analysts have reason to take a closer look to ensure the event doesn’t have more sinister intentions. Perhaps a system account has logged in from an IP address included in a list of threat actor group subnets, or you have detected HTTP connections to watering hole sites already discovered by researchers and shared in an intelligence feed.

There are three types of intelligence feeds: public, private, and commercial. Finding good intelligence feeds (usually private) generally requires collaboration among industry peers willing to share information on indicators they have discovered in their operations. Some threat intel exchange groups like Defense Security Information Exchange (DSIE), Cyber Information Sharing and Collaboration Program (CISCP), and the various Information Sharing and Analysis Centers (ISACs) are industry (sometimes country) specific—and in a few cases, public/private collaborations.

NOTE

The National Council of ISACs publishes a list of member ISACs, including such groups as the following (to name a few):

§ Financial Services: FS-ISAC

§ Defense Industrial Base: DIB-ISAC

§ National Health: NH-ISAC

§ Real Estate: REN-ISAC

However, there are dozens of feed providers, generally available to the public like: Abuse.ch, Shadowserver, Team Cymru, Malc0de, DShield, Alienvault, Blocklist.de, Malwaredomains, and many others. Selecting a feed really boils down to the trust you put in the organization and the quality of the feed. This is why industry partnerships and ISACs work well, as long as everyone within the groups is sharing information. Free online feeds are useful for broader coverage of less sophisticated or targeted attacks, but will never be specific to temporal and advanced threat actors.

It’s also important to understand that different organizations share different types of data at different levels of classification or sensitivity. This is one reason for the Traffic Light Protocol (TLP). This protocol allows organizations to score intelligence information on a shareability scale. In other words, threat intelligence can be coded red, amber, green, or white, depending on the perceived sensitivity. The US-CERT provides the following guidance on leveraging the protocol (Figure 7-5).

Traffic Light Protocol (source: US-CERT, https://www.us-cert.gov/sites/default/files/TLP.pdf)

Figure 7-5. Traffic Light Protocol (source: US-CERT, https://www.us-cert.gov/sites/default/files/TLP.pdf)

After you have found the feeds you intend to consume, you’ll need to prepare for importing and analysis. One of the biggest problems with threat intelligence sharing is the lack of any fully accepted standard for indicator format. There are a variety of options:

§ Structured Threat Information Expression (STIX)

§ Incident Object Description and Exchange Format (IODEF)

§ Collective Intelligence Framework (CIF)

§ OpenIOC

§ CybOX

The challenge is to build a system that can handle multiple metadata formats, as well as multiple file formats. Threat feeds can come in via XML/RSS, PDF, CSV, or HTML. Some intelligence sources are not even aggregated and have to be distilled from blog posts, email lists, bulletin boards, or even Twitter feeds.

After the threat data is collected and standardized (not unlike how we standardized log features), the indicators need to tie in to your security monitoring infrastructure. Applying the indicators to your existing log data delivers a broader set of previously unknown data to enable better report building for your playbook. Like with all tools, there are many ways of managing indicators. Databases, flat files, or even commercial and open source management systems can store and recall indicators that can be leveraged by your log monitoring and alerting systems. Intelligence indicators can be used to both detect and prevent threats—it all depends on which security tools you enhance with indicator data. For example, you could add hostname-based indicators to your DNS RPZ configuration to prevent any internal hosts from resolving known bad hosts. You could also add IP-based indicators to your firewall policies to prevent any communications. On the detection side, you could simply watch for indicators flagged in your monitoring systems and then follow up with a more intensive investigation.

Like the other tools and processes, you’ll want to measure the efficacy of your applied intelligence. Knowing which feeds provide the best value can help you determine the priority in handling their outputs, the source’s trustworthiness in terms of data fidelity, and how effective the intel sources are at enhancing your log data. Therefore, it’s always a good idea to tie your intelligence indicators to their source. As you analyze events through reports in your playbook, you should be able to determine not only where an indicator was sourced, but also why an event was flagged as malicious. You can also graph relationships (e.g., trends, outliers, repetitive patterns) between disparate investigations if a common intelligence source prompted the investigations. These relationships can help to discover future incidents and highlight security architecture improvement opportunities.

Let’s look at real-world examples

A threat intelligence system feeds a playbook nicely. It will help find known threats, and provide information about your exposure and vulnerability to those threats. You can automate threat intelligence data analysis by running queries across your security log information against reported indicators. Intelligence indicators enhance the DNS, HTTP, NetFlow, host security, and IDS event sources. You could:

§ Take a feed of known bad C2 domains and run an automatic report looking to see what internal hosts attempt to resolve them

§ Auto-block some senders/domains based on phishing and spam feeds (prevention), and then query for any internal users afflicted by these campaigns by looking at callbacks and other indicators (detection)

§ Log and report when any internal host tries to contact a malicious URI

§ Take a specific policy-based action on a domain or URL that clients have resolved, which has been flagged in a feed with a low “reputation score”

§ Automatically create incident tracking and remediation cases based on feed data about your compromised internal hosts

Feeds automate the dirty work of detecting common threats and provide the security team with additional context that helps improve incident response decisions. Judgments about an external host can help analysts better understand a potential incident by providing some peer-reviewed bias. Ultimately, the intelligence can lead to new reports in your incident response playbook. However, only subscribing to a variety of feeds is not a comprehensive answer to your internal security.

Limitations

Locally sourced intelligence is also highly effective, doesn’t have any of the disadvantages of a giant statistical cloud, and it can be more precise and effective for your organization. This is particularly true when responding to a targeted attack. If you have gathered evidence from a prior attack, you can leverage that information to detect additional attacks in the future. If they are targeted and unique, there may be no feed available to tell you about the threat. The other advantage to internal intelligence is that it is much more context aware than a third-party feed.

You and the rest of your IT organization—hopefully—know the function and location of your systems. With proper context (that can only be developed internally), the feed data could be extremely helpful, or worthless, depending on your response process and capabilities.

The most effective attackers will also monitor external threat feeds. If their versions of exploit kits, their hosts, or any of their assets are implicated by a threat feed, it’s time to change tactics. Attackers run their own malware hashes through various online detectors to see if their campaign has been exposed and detected. After the attackers change their tactics, the threat feed is moot until the new tactics are analyzed.

Reputation scores, malware lists, spam lists, and others can never be fully current. They are completely reactionary because of gathering and analyzing events that have already happened. How many attacks have you detected where the exploit kit, the dropper, and the latter stage attacks were always hosted at the same location for weeks? Initial attacks want to hit and run. It is trivial for an attacker to bring online a brand new domain and website, install an exploit kit, and when the victims are compromised, discard the domain once satisfied with the bot count. A dynamic DNS provider service offers a simple and common technique to burn through thousands of unique, one-time hostnames for attack campaigns. Regardless of the attack methods, a reputation score or vetted evidence cannot instantly be calculated. There will always be a lag in detection and propagation time of threat information. Because your team understands internally developed intelligence so much better, you can create higher-level, broad patterns rather than just using specific lists of known bad indicators.

To be fair to reactive threat feeds, it is called incident response, meaning we respond to an event after the fact. The key is to take action as fast as possible for situations where threats cannot be prevented.

The essential pros and cons of integrating threat intelligence come down to:

§ Your historical data and data researched by other security experts gives you additional detection capabilities.

§ Legitimate threat data can be used to block attack campaigns before they reach your organization, shortening the window of opportunity for attackers.

§ Deploying threat intelligence collection requires a system to manage, prepare, and potentially share threat data.

§ Proper threat research takes a lot of time, which can be challenging for a security operations team to keep up.

§ Valuable threat intelligence is only good if it’s fresh. Attackers change tactics and hosts so often that intelligence-based indicators only work well for a short period of time.

Shutting the Toolbox

In Figure 7-6, the rings represent various common security monitoring tools. Each 120-degree slice represents a particular threat. The green shaded areas identify tools that specialize or excel in detecting the threat listed in the slices. This doesn’t necessarily mean that the unshaded tools are unable to perform in those threat areas. It simply highlights the strengths and relative weakness of each tool in the context of these three common threats: network, host, or user anomalies; command-and-control traffic or data exfiltration; and compromised (infected) systems. It also confirms that the more data you have access to, the better your potential to detect threats.

There are so many security tools and technologies that it’s difficult to figure out the best manageable architecture. Selecting a broad group of tools with niche capabilities enables you to understand what’s most effective for your network and what’s redundant or unhelpful. It’s also important to keep in mind that tools and technologies come and go. All of us remember very helpful detection tools we’ve used in the past that have ceased development or were owned by a company that went out of business; competitive pressures have sometimes forced our hand in other ways as well.

Sample overlay of threats per detection tool

Figure 7-6. Sample overlay of threats per detection tool

Many of us are constantly testing and trying new products, or enabling and testing features of existing products to deliver the best blend of monitoring capabilities and performance. It seems the combinations of approaches to security monitoring are innumerable. However, while any defense-in-depth architecture could provide the proper data for monitoring, the strategy and operation truly makes incident response work. The playbook is the documented strategy that’s simply fed by event data from your monitoring tools. The tools are really just that—implements to help you accomplish work.

Putting It All Together

Your work, as incident responders, is defined and prescribed in the playbook. Therefore, if you put garbage into your playbook, you get garbage out of your playbook (and tools). You can create a better playbook with an appropriate toolset, a fundamental understanding of your network architecture, and an awareness of your security risk profile.

Chapter Summary

§ The right tools for your environment depend on a myriad of factors including budget, scale, familiarity with the products, and detection strategy.

§ Network-based detection systems obviate many problems with unmanaged systems, but log data or host security data is the closest to the source of trouble.

§ There are as many tools for security monitoring as there are different approaches, but tools should be selected not based on their ability to do more, but rather to do at least one thing well, while providing adequate details as to why something was detected or blocked.

§ Target individuals and critical systems with additional monitoring and technology as necessary.

§ Host and network intrusion detection will always have a place in the security monitoring toolkit, but they are only valuable if they are tuned to match your organization.

§ Focus on fundamental network traffic and applications like DNS and HTTP for highly effective monitoring.

§ NetFlow can be both a powerful detection and correlation tool.

§ Threat intelligence can be developed internally, as well as sourced from third parties. The key is to integrate validated intelligence into your playbook development and operation.