Applied Data Science

Choosing the right data set, and sensor technologies, in creating new machine learning algorithms is challenging ¹. There are well-defined intrusion datasets that researchers have assembled, and used in researching novel threat hunting approaches; yet, these data sets are tightly coupled to specific sensor types and biases. If we are to further evolve the application of data science to network security, we should be more discriminatory in the way we look at network sensors.

This post reviews different types of network security devices in terms of bias, fidelity, coverage and locale. It provides a conceptual framework to drive the application of disruptive data science applications in the network security field.

Before we start, lets talk briefly about data science. Data science and machine learning has been applied in the network security domain for many years now ². Yet we are still struggling to make sense of its applications.

How can we leverage sensor bias, fidelity, coverage and locale to move from a reactive defensive posture, and being more proactive? Data science is about understanding data provenances, transforming data into useful perspective to hunt for patterns, extraction of value from the data, and providing useful visualizations and communications of the data. Within this viewpoint underlies a process, a continuum, that includes data acquisition, transformation, analysis, and visualization. Yet, our ability to extract value from the data, knowledge, is limited by the data source, that is, how we observe the world around us, coupled with our ability to extract patterns from that data.

Sensor data fusion in network security defense has been used in a wide variety of ways ³. In developing a fusion model, we should define how the data comes together based on the differing sensor characteristics. Traditionally, fusion data by IP address, and/or time has prove to be effective in make sense of threat behaviors. But, perhaps there are other perspectives that can be brought out from the data sets to improve our understanding of cyber threats and system health.

By collecting and transforming data from other sensor types, say with higher coverage, we can find novel patterns in network traffic, contrasting this by the use of partial coverage associated with managing alerts generated by IDS systems. In using high coverage sensors like flow data, you can provide a completely different understanding of your network. The difference in scale between these two methods is frankly, astonishing. Lets start to define what sensor bias, fidelity, coverage and locale means.

Patterns involving millions of flows, from 100s of thoughts of devices viewed interactively by leveraging flow data. (Taken from a Sonalysts presentation at MIT).

Cyber Threats Undertanding is Complex: First Define our Data Science Objectives, then Choose our Data

Cyber threat intelligence (CTI) represent large data sets used to uncover threat vectors within compromised system, where system compromise may of not of been known for a very long time. CTI open standards have incorporated descriptive relationships, in terms of knowledge graphs, showing how the indicators/observables are related to each other. In this context, threat hunting is a very complex operations that connects multiple data points ⁴.

What comes does come first, sometimes our data science objectives are limited by our data sources. This limitation can corner much of the advancement of new ideas within a domain, like network security. Threat hunting is one of the focuses in network security (make new post!, put diagram of threat, and system behaviors, counterpoint).

The cyber threat, and cyber operations are distributed, decentralized and operate over concurrent and asynchronous time intervals. Cyber operations, being government by a nation state can be executed with multiple teams, working together to gain access to information. These operations can be executed in the same week, or operated over extended cycles digging into our supply chains. Yet, our defensive posture has grown from asking the question, “what is happening now?” based on rules created by known and existing threats. These rules are known as countermeasures. How can we move past this reactive posture?

The threat is distributed, decentralized and operations over multiple time scales

Defining Sensor Bias, Fidelity, Coverage and Local

Sensor bias is seen as limit based on how a sensor is tuned to observe. For example, IDS/IPS and firewalls are predominantly rule-based systems. You need to know what you are looking for to create a rule, also known as a countermeasure. Contrast this to network flow data, or pcap data, there is now filtering of the data, rather all the data is present and can be used to find anomalies, and define what is normal. Sensor bias then leads to sensor fidelity.

Sensor fidelity can be measured in terms of how much meta data is gathered per event produced, or per flow. Pcap data has the highest fidelity where it includes the actual packet data, which can also be found in IDS/IPS data. But, now every flow, or pcap data set produces an event from IDS, if there doesn’t exist a countermeasure, or rule, setup to create an event, or if the sensor is placed on the network, or host, where certain activity cannot be seen. Lets talk about coverage of the attack surface next.

Sensor Coverage is the notion of how well a sensor covers the attack surface of an enterprise based on its locale. PCAP and flow data have complete coverage of a network, from their locale. But, PCAP data has a very high fidelity of data that is captured, and can overwhelm monitoring systems, and not all pcap data collected can be analyzed. Network flow data does not have as much fidelity as PCAP data, in that it aggregates the packet based based on a complete flow, communication between the remote and locale system. Where one flow record aggregates the communication between two systems. This leads to sensor placement, or sensor locale.

Sensor locale is an important concept in that sensor can be placed within the internal network, dmzs, and/or external to a organizations network. Each network locale will see different types of network activities. For example, packets NAT’d outside the firewall will make it difficult to interpret internal traffic, but, easier to interpret traffic that is blocked by the firewall.

Having a deeper understanding of sensor Bias, fidelity, and locale can drive the acquisition of data, its orientation and transformations to better make sensor of what is happening in our networks.

Why is it Important to Look at Sensors this way?

Choosing the right data set to create new machine learning algorithms is challenging. We need to better understand the strengths and limitations of our sensors before we do so. Understanding sensor bias, fidelity, coverage and locale, can provide new perspectives to be drawn from network security and our situational awareness. The following diagram represents my basement network security lab which I will use to talk to a sensor, what it can see (locale), the type of data it gathers (bias), and the volume of data collected over time (fidelity).

When you look at this network architecture you will see that it is outward focused, that is, all internal traffic is NAT’d through the firewall, and those specific devices are “hidden”. Someday I will move, or add in another TAP, and look at internal system behaviors.

In a large enterprise, with 1000s of internal devices, network monitoring will be done more so, on the inside of the network.

Home network, with network tap, and monitoring tools e.g. kali, SiLK, Argus, Bro, Surricata, pfsense.

To orient ourselves in defining normal patterns in complex systems, we need a mix of data that spans many locales, and has a relatively lower fidelity to manage data velocity challenges.

Firewall Data

There are a number of different categories of firewalls: host-based firewalls, server-based/appliance-based firewalls, packet filtering servers, bastion servers, circuit-level firewalls, stateful-inspection firewalls, next-generation firewalls, cloud/virtual-firewalls (firewall as a service). Some folks look at firewalls as glorified routers.

Firewall data have a fidelity less than PCAP data, but greater than flow data. But, firewall data only creates alerts based on rules, so, every network activity is not represented, unlike flow and PCAP data. Firewalls can be stood up on the periphery of a network, along with between major systems in on the inside of a network.

I chose to use and install pfsense in my home network, upgrading from OpenBSD. In this case a firewalls is defined as a packet filtering devices that have multiple stages that can be used to analyze network traffic based on rules. The following is a quick example of some starting rules from my old OpenBSD. Essentially, you will be blocking, or passing traffic based on a rule, focused on a IP, a port, etc… .

The following rule is focused on blocking incoming non-routable IP addresses.

This rule blocks incoming non-routable IPv4 addresses.

On the pfsense box you can dump data being streamed to the pflog0 interface and review it. In this case you can see that there is a block rule match.

dumping the pf sense log, logging at data

IDS/IPS Data

Intrusion detection system are a different type of network security appliance. These security appliances use rules to identity threat activity based on a number of networking attributes like IP blacklisting, port usage, but most importantly packet inspection.

IDS data can have a variable amount of fidelity based on how the rule is setup. The coverage of these systems is on par with firewalls, but below PCAP and flow data.

The following are some rules that you can find in Snort IDS, the first looking at SSH buffer overflow having content “/bin/sh”, and the the next alert is looking for specific content, “|90|” within a packet.

The following is a dump of a snort log.

Packet Capture Data (PCAP)

Packet capture data is managed through tools like wireshark that leverage libpcap to hook into a interface, capturing and analysing PCAP data.

PCAP data has the highest fidelity of the data compared in this post, and there has the highest processing and storage requirements. Each packet of data captured from the network consists of Time, source ip, dest ip, source port, destination port, protocol, length, and the captured packet contents.

Wireshark tool displaying packet capture data.

Network Flow Data

Network was initially developed by Cisco to establish metrics traversing a network from a specific ingress point. Generally the metrics were looking into volume and network traffic types e.g. icmp, udp, etc.. .

Flow data consists of a tuple containing source and destination IPs, source and destination ports (ICMP traffic doesn’t have a port), protocol information (e.g. TCP flags), number of bytes, number of packets. Like PCAP data, flow data has some of the highest coverage in terms of network infrastructure, but, the fidelity of data is a lot lower than PCAP data. Flow data is like a summary usage report between two different devices.

Netflow data is captured using a flow exporter, collector, and analyzer. USCERT has a complete set of tools that one can used to tap into their network and analyze data called the System for Internet Level Knowledge, SILK. The flow exporter could be a switch, that has a monitor port through which flow data is exported on. It could also b exported using an software tools like USCERTs YAF (yet another flow). YAF is used to create IPFIX type flow data from the raw network collected from a specific locale. The collector is used to store the flow data, and the analyzer to analyze it.

Data Science and Driving the OODA Cycle

Data science can be seen as driving much of the OODA cycle (Observe, Orient, Decide and Act) as we manage our businesses processes, and government functions. In 2010, a Cyber OODA decision-making model was proposed to address OODA in the context of cyber operations ⁶. Where observations can be made by acquiring data from various sources like market data, network security appliances, astronomical observations from telescopes, health data from patients are remote sensors. Unfortunately the data source can drive and limit our situational awareness; it can provide gaps in our analysis, limits in our ability to extract useful and meaningful patterns. To find these limits, it is important to quantify the differences in our data sources ability to observe our problem domains.

This post begins to define a conceptual framework that can be used to compare and contrast data sources in terms of sensor bias, fidelity, coverage and locale in the network security domain. Lets take a deeper look into this conceptual framework, and ask the question, what portions of the cyber kill chain are we observing, and striving to observe?

Applying and Contrasting Bias, Fidelity and Locale

Traditional managed security services and technologies (MSS) provide a spreadsheet like way of navigating event data produced from rule-based systems. Each alert is orchestrated in terms of mitigation. These system are used to detect and respond to alerts created from their sensors.

Leveraging high coverage sensors types like flow data ⁵, you can start look at the network data very differently, compared to traditional alert monitoring systems. This can allow the analyst to:

Sift through larger time windows, and volumes of data, looking for patterns
Annotate and filter out known issues from the data set to leverage the increased sensor coverage to find patterns that have not yet been discovered
Shift threat hunting into a more probabilistic domain, providing key insight into as devices and network behaviors shift from the normal to the abnormal
Connect alerts and anomalies over longer periods of time, from hours and days to weeks, and years.

A time-based visualization of flow data taken from a presentation at MIT (Sonalysts, Inc). The visualization uses transformed flow data and represents millions of flows. Flow data has a higher network coverage compared to other sensor types offers researchers to find interesting patterns in data that extend over larger time frames.

Conclusions and Summary

This post defined four concepts used in comparing and contrasting network sensors. Doing so can facilitate data scientists decision making in choosing the right sensor, or combination of sensors, for future model development. This perspective allowed me to start to define system behavior analytics starting back in 2006. Since system behavior analytics started evolving back in 2006, leveraging flow data, there has been little done to move the technique forward. System behavior analytics has facilitated the evolution of threat detection from being focused on vulnerabilities, flipping the the way we look at network defense, to start to look at threat behaviors, and/or abnormal system behaviors.

Over the years, network security has evolved from being a process focused solely on events focused on a system vulnerabilities (CVE) and operational aspects (CPE), to being something more threat-centric (Mitre Attack Framework). Yet we are only focused essentially on the threat, and are missing an important understanding of computer systems and cyber physical systems, which is defined by the question: What is normal?

References

1. Ankit,Thakkar; Ritika, Lohiya, "A Review of the Advancement in Intrusion Detection Datasets" in Procedia Computer Science Volume 167, 2020, Pages 636-645

2. O. McCusker, S. Brunza and D. Dasgupta, "Deriving behavior primitives from aggregate network features using support vector machines," 2013 5th International Conference on Cyber Conflict (CYCON 2013), 2013, pp. 1-18.

3. Guoquan Li, Zheng Yan, Yulong Fu, Hanlu Chen, "Data Fusion for Network Intrusion Detection: A Review", Security and Communication Networks, vol. 2018, Article ID 8210614, 16 pages, 2018. https://doi.org/10.1155/2018/8210614

4. Milajerdi, S.M., Eshete, B., Gjomemo, R., et al. (2019) POIROT: Aligning Attack Behavior with Kernel Audit Records for Cyber Threat Hunting. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, 11-15 November 2019, 1795-1812. https://doi.org/10.1145/3319535.3363217

5. McCusker, O., Brunza, S., Carvalho, M., Dasgupta, D., & Vora, S. (2013). A combined discriminative and generative behavior model for cyber physical system defense. Proceedings - 2013 6th International Symposium on Resilient Control Systems, ISRCS 2013, 144-149. https://doi.org/10.1109/ISRCS.2013.6623767

6. Sorensen,Christian (2010), "CYBER OODA: TOWARDS A CONCEPTUAL CYBERSPACE FRAMEWORK", https://apps.dtic.mil/sti/pdfs/AD1019164.pdf