Preface - Network Security Through Data Analysis: Building Situational Awareness (2014)

Network Security Through Data Analysis: Building Situational Awareness (2014)


This book is about networks: monitoring them, studying them, and using the results of those studies to improve them. “Improve” in this context hopefully means to make more secure, but I don’t believe we have the vocabulary or knowledge to say that confidently—at least not yet. In order to implement security, we try to achieve something more quantifiable and describable: situational awareness.

Situational awareness, a term largely used in military circles, is exactly what it says on the tin: an understanding of the environment you’re operating in. For our purposes, situational awareness encompasses understanding the components that make up your network and how those components are used. This awareness is often radically different from how the network is configured and how the network was originally designed.

To understand the importance of situational awareness in information security, I want you to think about your home, and I want you to count the number of web servers in your house. Did you include your wireless router? Your cable modem? Your printer? Did you consider the web interface to CUPS? How about your television set?

To many IT managers, several of the devices listed didn’t even register as “web servers.” However, embedded web servers speak HTTP, they have known vulnerabilities, and they are increasingly common as specialized control protocols are replaced with a web interface. Attackers will often hit embedded systems without realizing what they are—the SCADA system is a Windows server with a couple of funny additional directories, and the MRI machine is a perfectly serviceable spambot.

This book is about collecting data and looking at networks in order to understand how the network is used. The focus is on analysis, which is the process of taking security data and using it to make actionable decisions. I emphasize the word actionable here because effectively, security decisions are restrictions on behavior. Security policy involves telling people what they shouldn’t do (or, more onerously, telling people what they must do). Don’t use Dropbox to hold company data, log on using a password and an RSA dongle, and don’t copy the entire project server and sell it to the competition. When we make security decisions, we interfere with how people work, and we’d better have good, solid reasons for doing so.

All security systems ultimately depend on users recognizing the importance of security and accepting it as a necessary evil. Security rests on people: it rests on the individual users of a system obeying the rules, and it rests on analysts and monitors identifying when rules are broken. Security is only marginally a technical problem—information security involves endlessly creative people figuring out new ways to abuse technology, and against this constantly changing threat profile, you need cooperation from both your defenders and your users. Bad security policy will result in users increasingly evading detection in order to get their jobs done or just to blow off steam, and that adds additional work for your defenders.

The emphasis on actionability and the goal of achieving security is what differentiates this book from a more general text on data science. The section on analysis proper covers statistical and data analysis techniques borrowed from multiple other disciplines, but the overall focus is on understanding the structure of a network and the decisions that can be made to protect it. To that end, I have abridged the theory as much as possible, and have also focused on mechanisms for identifying abusive behavior. Security analysis has the unique problem that the targets of observation are not only aware they’re being watched, but are actively interested in stopping it if at all possible.


Several years ago, I talked with an analyst who focused primarily on a university hospital. He informed me that the most commonly occupied machine on his network was the MRI. In retrospect, this is easy to understand.

“Think about it,” he told me. “It’s medical hardware, which means its certified to use a specific version of Windows. So every week, somebody hits it with an exploit, roots it, and installs a bot on it. Spam usually starts around Wednesday.” When I asked why he didn’t just block the machine from the Internet, he shrugged and told me the doctors wanted their scans. He was the first analyst I’ve encountered with this problem, and he wasn’t the last.

We see this problem a lot in any organization with strong hierarchical figures: doctors, senior partners, generals. You can build as many protections as you want, but if the general wants to borrow the laptop over the weekend and let his granddaughter play Neopets, you’ve got an infected laptop to fix on Monday.

Just to pull a point I have hidden in there, I’ll elaborate. I am a firm believer that the most effective way to defend networks is to secure and defend only what you need to secure and defend. I believe this is the case because information security will always require people to be involved in monitoring and investigation—the attacks change too much, and when we do automate defenses, we find out that attackers can now use them to attack us.[1]

I am, as a security analyst, firmly convinced that security should be inconvenient, well-defined, and constrained. Security should be an artificial behavior extended to assets that must be protected. It should be an artificial behavior because the final line of defense in any secure system is thepeople in the system—and people who are fully engaged in security will be mistrustful, paranoid, and looking for suspicious behavior. This is not a happy way to live your life, so in order to make life bearable, we have to limit security to what must be protected. By trying to watch everything, you lose the edge that helps you protect what’s really important.

Because security is inconvenient, effective security analysts must be able to convince people that they need to change their normal operations, jump through hoops, and otherwise constrain their mission in order to prevent an abstract future attack from happening. To that end, the analysts must be able to identify the decision, produce information to back it up, and demonstrate the risk to their audience.

The process of data analysis, as described in this book, is focused on developing security knowledge in order to make effective security decisions. These decisions can be forensic: reconstructing events after the fact in order to determine why an attack happened, how it succeeded, or what damage was done. These decisions can also be proactive: developing rate limiters, intrusion detection systems, or policies that can limit the impact of an attacker on a network.


Information security analysis is a young discipline and there really is no well-defined body of knowledge I can point to and say “Know this.” This book is intended to provide a snapshot of analytic techniques that I or other people have thrown at the wall over the past 10 years and seen stick.

The target audience for this book is network administrators and operational security analysts, the personnel who work on NOC floors or who face an IDS console on a regular basis. My expectation is that you have some familiarity with TCP/IP tools such as netstat, and some basic statistical and mathematical skills.

In addition, I expect that you have some familiarity with scripting languages. In this book, I use Python as my go-to language for combining tools. The Python code is illustrative and might be understandable without a Python background, but it is assumed that you possess the skills to create filters or other tools in the language of your choice.

In the course of writing this book, I have incorporated techniques from a number of different disciplines. Where possible, I’ve included references back to original sources so that you can look through that material and find other approaches. Many of these techniques involve mathematical or statistical reasoning that I have intentionally kept at a functional level rather than going through the derivations of the approach. A basic understanding of statistics will, however, be helpful.

Contents of This Book

This book is divided into three sections: data, tools, and analytics. The data section discusses the process of collecting and organizing data. The tools section discusses a number of different tools to support analytical processes. The analytics section discusses different analytic scenarios and techniques.

Part I discusses the collection, storage, and organization of data. Data storage and logistics are a critical problem in security analysis; it’s easy to collect data, but hard to search through it and find actual phenomena. Data has a footprint, and it’s possible to collect so much data that you can never meaningfully search through it. This section is divided into the following chapters:

Chapter 1

This chapter discusses the general process of collecting data. It provides a framework for exploring how different sensors collect and report information and how they interact with each other.

Chapter 2

This chapter expands on the discussion in the previous chapter by focusing on sensors that collect network traffic data. These sensors, including tcpdump and NetFlow, provide a comprehensive view of network activity, but are often hard to interpret because of difficulties in reconstructing network traffic.

Chapter 3

This chapter discusses sensors that are located on a particular system, such as host-based intrusion detection systems and logs from services such as HTTP. Although these sensors cover much less traffic than network sensors, the information they provide is generally easier to understand and requires less interpretation and guesswork.

Chapter 4

This chapter discusses tools and mechanisms for storing traffic data, including traditional databases, big data systems such as Hadoop, and specialized tools such as graph databases and REDIS.

Part II discusses a number of different tools to use for analysis, visualization, and reporting. The tools described in this section are referenced extensively in later sections when discussing how to conduct different analytics.

Chapter 5

System for Internet-Level Knowledge (SiLK) is a flow analysis toolkit developed by Carnegie Mellon’s CERT. This chapter discusses SiLK and how to use the tools to analyze NetFlow data.

Chapter 6

R is a statistical analysis and visualization environment that can be used to effectively explore almost any data source imaginable. This chapter provides a basic grounding in the R environment, and discusses how to use R for fundamental statistical analysis.

Chapter 7

Intrusion detection systems (IDSes) are automated analysis systems that examine traffic and raise alerts when they identify something suspicious. This chapter focuses on how IDSes work, the impact of detection errors on IDS alerts, and how to build better detection systems whether implementing IDS using tools such as SiLK or configuring an existing IDS such as Snort.

Chapter 8

One of the more common and frustrating tasks in analysis is figuring out where an IP address comes from, or what a signature means. This chapter focuses on tools and investigation methods that can be used to identify the ownership and provenance of addresses, names, and other tags from network traffic.

Chapter 9

This chapter is a brief walkthrough of a number of specialized tools that are useful for analysis but don’t fit in the previous chapters. These include specialized visualization tools, packet generation and manipulation tools, and a number of other toolkits that an analyst should be familiar with.

The final section of the book, Part III, focuses on the goal of all this data collection: analytics. These chapters discuss various traffic phenomena and mathematical models that can be used to examine data.

Chapter 10

Exploratory Data Analysis (EDA) is the process of examining data in order to identify structure or unusual phenomena. Because security data changes so much, EDA is a necessary skill for any analyst. This chapter provides a grounding in the basic visualization and mathematical techniques used to explore data.

Chapter 11

This chapter looks at mistakes in communications and how those mistakes can be used to identify phenomena such as scanning.

Chapter 12

This chapter discusses analyses that can be done by examining traffic volume and traffic behavior over time. This includes attacks such as DDoS and database raids, as well as the impact of the work day on traffic volumes and mechanisms to filter traffic volumes to produce more effective analyses.

Chapter 13

This chapter discusses the conversion of network traffic into graph data and the use of graphs to identify significant structures in networks. Graph attributes such as centrality can be used to identify significant hosts or aberrant behavior.

Chapter 14

This chapter discusses techniques to determine which traffic is crossing service ports in a network. This includes simple lookups such as the port number, as well as banner grabbing and looking at expected packet sizes.

Chapter 15

This chapter discusses a step-by-step process for inventorying a network and identifying significant hosts within that network. Network mapping and inventory are critical steps in information security and should be done on a regular basis.

Conventions Used in This Book

The following typographical conventions are used in this book:


Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.


This icon signifies a tip, suggestion, or general note.


This icon indicates a warning or caution.

[1] Consider automatically locking out accounts after x number of failed password attempts, and combine it with logins based on email addresses. Consider how many accounts you can lock out that way.