The power of the data science approach lies in looking at large samples of data as a whole and trying to uncover the overall patterns. Data science is the process of extracting insights, patterns and useful information from data to solve problems and make informed decisions. There is a growing interest in applying data science and machine learning solutions to the cyber security domain, in order to prevent malicious activity within a network.
The aim of this article is to provide an overview of some existing challenges in cybersecurity, which we can potentially address using data science. In addition, the article highlights part of our journey at Hunt & Hackett, where we had previous experience in applying data science to various domains, but not to cybersecurity. We show what we have learned from our attempts to combine data science with cybersecurity. By sharing our experiences, we hope to help future data scientists better prepare for the challenges of cybersecurity.
Traditional cybersecurity solutions include rule-based detection, where predefined rules are set up to detect specific types of malicious activity or attack patterns. These more traditional cybersecurity methods have limitations:
There are a number of challenges that can be potentially addressed with more data-driven approach. These include:
The cyber security landscape is dynamic. New ideas and techniques for hacking systems can emerge unexpectedly. Attackers can also take advantage of available machine learning solutions. Using data science approaches to detect malicious behaviour can provide a way to update existing solutions with new data. As an organisation's network changes over time, the model can keep pace by using new data. If the infrastructure is well prepared, deploying a new version of a model can be relatively simple compared to updating manually created rules.
This doesn't mean that manually created rules are useless, but each solution has a specific problem space for which it works best. The addition of data science fills a problem space that is difficult to solve with traditional rules. This also means that traditional rules can be used by data science as one of the inputs. Ultimately, a layered approach to intrusion detection should be adopted. This means selecting the best solution for each specific problem space. This may include a single method or more complex ensemble methods to tackle problems with different approaches at different stages.
While working on our data science for cyber security projects, we came across a number of challenges that seemed strongly related to the cyber domain.
Initially, we focused on collecting benign and malicious examples for our problem, with the idea of doing supervised learning. We thought that we could collect a good representation of malicious examples to train a model to distinguish between normal and abnormal behaviour. It turned out that there are far fewer malicious examples than we expected in both private and public datasets, for example from the Nishang public dataset, we were able to extract about 200 malicious Powershell commandlines which was much less than we expected to find. It now seems that unsupervised or self-supervised approaches may be better suited to many cyber tasks. The reason for this is that malicious events are quite rare, so it may be more appropriate to train a model to simply learn the patterns of normal behaviour (since we have lots of "normal" data), and then label any new sample that deviates significantly from that as anomalous.
Our initial assumption was that we could do the preparation and cleaning of the data more or less automatically, and that we would not run into problems with extracting features from the data. However, we did run into problems with one of the projects where we wanted to classify whether a given command line was malicious or not. This involved standardizing the data, removing noise and breaking it down into proper tokens that could be used as input features for a model. In the end, we had a pre-processing pipeline that was partly manually created to clean the data.
For tasks that involve predicting the type of a connection within a network, whether it is normal or abnormal, we encountered the problem of a limited amount of labelled data for evaluation. For an unsupervised model, we assume (e.g. after verification with cyber experts) that a part of the historical data represents "normal" behaviour. The challenge then becomes how to comprehensively evaluate the trained model to best understand its promises and limitations. One possible solution is to generate a set of "normal" and "abnormal" samples, either automatically or manually with domain experts. However, this can be time and energy consuming, so it should be well planned into the project spend.
To work effectively with data, it is essential to have at least a basic understanding of the cyber security landscape. However, even after grasping the basic concepts, we often find ourselves relying on the expertise of cyber domain experts at different stages of projects. The cyber security domain can be challenging, especially for beginners. As data scientists, we may spend a lot of time immersed in it, but certain intricacies can only be fully understood with the guidance of an expert. Some corner cases may depend on specific contexts and require a broader understanding of the domain for proper interpretation.
In our journey so far we have encountered a number of challenges that have led to many lessons and ideas for the future. For the problem of limited number of malicious examples, we think that a good research direction is to go for the unsupervised approaches that can model the historical 'normal' behaviour and detect any suspicious deviations from it. Data processing made us aware that there are many ways to standardize cyber data and that the choice of method can directly affect the results. We also found that in order to further understand and improve your models, it may be necessary to work closely with cyber domain experts, sharing thoughts on progress so far and analysing corner cases together.
Finally, if you're a data scientist yourself, should you get into cybersecurity? If you like challenges, cybersecurity data science is a rewarding experience. Working with cyber data exposes a data scientist to real-world challenges that you wouldn't normally encounter when working with data that is simple, clean and readily available. By overcoming these challenges, you can gain valuable insights and develop a more realistic understanding of what it really means to be involved in data science. It will allow you to expand your skill set, overcome obstacles and navigate complex scenarios, ultimately enhancing your expertise in the field.