As security practitioners, we are often called upon to dig into the details of the latest threats, but are we really seeing the big picture? What could we learn by looking at larger sets of data, and what techniques should be used to do it?
Jay Jacobs is a Principal on the Verizon RISK team, and one of the co-authors of the Verizon Data Breach Investigations Report. I think of him as a security incident whisperer, because he’s been applying data science to these challenging security questions. Jay and Bob Rudis have written a book on gathering, analyzing and visualizing security data called Data Driven Security, which was officially released this week.
Our paths first crossed many years back, and I wanted to catch up with Jay and get some of his perspectives on what he’s seeing today.
Brian: Jay, what is data driven security all about?
Jay: Data driven security is about improving your ability to learn from security data. A good data-driven security program combines security expertise, programming, data management, statistics and data visualization techniques. We cover these skills in our book, and walk through several hands-on examples with real (and downloadable) data. We aren’t inventing anything new here, but rather we’re taking analytical techniques and practices used in other disciplines and applying it to information security.
Brian: Let’s talk a bit about network security. In a recent blog post, you talk about instrumenting the data collection process by running a number of Internet-facing honeypots. These honeypots log security events that might get overlooked, such as the ports that get probed and how they trend over time. Where do you start with all of this information? Do you have a set of questions that you want to answer, or is it more like detective work to follow the breadcrumbs and see where they go?
Jay: The primary reason for data driven security is to use data to answer questions that will improve your learning and consequently help you make more informed decisions. It’s always about answering questions. In some cases, like with the honeypot example, we want to “follow the breadcrumbs” and explore the data, and find the possibilities and limits for what the data can tell us.
Sometimes just looking at the data without an initial purpose will yield obvious results. As we learn more, we can develop follow up questions that we want to answer. For example, with the port scans in the honeypot data, we can ask and answer questions around the origin of the scans, and maybe we might make a map using geolocation of the IP addresses, but so what? What would we do with that information?
However, we could also ask something like “what services are scanned the most?” which could be used to inform our decisions around the use of default ports and firewall policies. We could even ask how often a particular service is scanned to get a sense for the exposures that we face daily.
Brian: One of your findings indicates that port scans aren’t all the same. Some ports are scanned more frequently (by a greater number of sources) than others, and they change over time. Is there any correlation between the spikes and the publication of a specific vulnerability?
Jay: That’s a great question and something we could research by correlating long term port scan data with specific vulnerabilities. Logically, whoever is doing the scanning is expecting something in return. If most of the scans were malicious in nature, we would expect most of the port scans go after known vulnerabilities. For example, when I first started watching traffic like this, TCP port 27977 was heavily scanned. At first, I couldn’t figure out why that odd port was targeted so often. After some research, I found that the TDSS malware establishes an open SOCKS relay on that port when it infects a machine. And I found that many others were looking for that port to support click-fraud. However, these days that port barely shows up as a target port for scans.
Watch for Part 2 of this interview on February 20.
We'll be covering many of these subjects at Ignite 2014, which takes place in Las Vegas March 31-April 2. Cybersecurity Industry Best Practices is among our list of marquee session tracks, which you can view here.