Security-related ML Datasets

This page documents some security-related Machine Learning-friendly datasets that I have found. I might might some of these for assignments in this class.

Note that sklearn has sklearn.dataset.fetch_openml which can be used to fetch any of the datasets below that are on OpenML.

Here is a jupyter notebook I wrote demonstrating fetching various security datasets.

Spam Email

The “spam” concept is diverse: advertisements for products/websites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose

OpenML

Credit Card Fraud

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Used in this AWS Sagemaker tutorial

Ctu 13

Project website

The CTU-13 is a dataset of botnet traffic that was captured in the CTU University, Czech Republic, in 2011. The goal of the dataset was to have a large capture of real botnet traffic mixed with normal traffic and background traffic. The CTU-13 dataset consists in thirteen captures (called scenarios) of different botnet samples. On each scenario we executed a specific malware, which used several protocols and performed different actions.

“An empirical comparison of botnet detection methods” Sebastian Garcia, Martin Grill, Jan Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp 100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011

Google Scholar link, includes a PDF

Python source code for analyses they ran can be found at https://sourceforge.net/projects/botnetdetectorscomparer/

Hybrid Analysis

Hybrid analysis provides a Threat Score from 0 to 100 which is probably just a conversion from a probability score 0 to 1.

They have an xml rss feed + api access, which can maybe be combined to do some interesting ML, to try to replicate their algorithm (for academic educational purposes only).

Finding Interesting Reports

Kdd Cup 99

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ‘bad’ connections, called intrusions or attacks, and ‘good’ normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

UCI dataset

It’s built in to sklearn datasets package, available via sklearn.datasets.fetch_kddcup99

Microsoft Kaggle Malware Classification

Classify malware into families based on file content and characteristics

In recent years, the malware industry has become a well organized market involving large amounts of money. Well funded, multi-player syndicates invest heavily in technologies and capabilities built to evade traditional protection, requiring anti-malware vendors to develop counter mechanisms for finding and deactivating them. In the meantime, they inflict real financial and emotional pain to users of computer systems. One of the major challenges that anti-malware faces today is the vast amounts of data and files which need to be evaluated for potential malicious intent. For example, Microsoft’s real-time detection anti-malware products are present on over 160M computers worldwide and inspect over 700M computers monthly. This generates tens of millions of daily data points to be analyzed as potential malware. One of the main reasons for these high volumes of different files is the fact that, in order to evade detection, malware authors introduce polymorphism to the malicious components. This means that malicious files belonging to the same malware “family”, with the same forms of malicious behavior, are constantly modified and/or obfuscated using various tactics, such that they look like many different files.

In order to be effective in analyzing and classifying such large amounts of files, we need to be able to group them into groups and identify their respective families. In addition, such grouping criteria may be applied to new files encountered on computers in order to detect them as malicious and of a certain family.

For this challenge, Microsoft is providing the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families.

Kaggle

Microsoft Malware Prediction

Can you predict if a machine will soon be hit with malware?

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.

With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.

As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Can you help protect more than one billion machines from damage BEFORE it happens?

Kaggle

Phishing Urls

PhishStorm

Here’s a phishing URL dataset that includes the actual URLs.

Marchal, Samuel, et al. “PhishStorm: Detecting phishing with streaming analytics.” IEEE Transactions on Network and Service Management 11.4 (2014): 458-471.

URLs dataset with features built and used for evaluation in the paper “PhishStorm: Detecting Phishing with Streaming Analytics” published in IEEE TNSM. The dataset contains 96,018 URLs: 48,009 legitimate URLs and 48,009 phishing URLs.

This is a CSV file where the “domain” column provides a unique identifier for each entry (which is actually a URL). The “label” column provides the domain entry status, 0: legitimate / 1:phishing. Other columns provide computed values for features introduced in [1].

IEEE

Dataset profile page

Direct link to datafile

Sherlock

Warning: This dataset has a lot of problems. The website is down. The Kaggle dataset only includes malicious samples for a tiny time window, so it's useless for ML. The raw datafiles are only hosted on a Google Drive. The authors don't respond to access requests.

The data collection method used is cool though.

Kaggle hosts a 2 week data sample from a single user here.

Project website

A long-term smartphone sensor dataset with a high temporal resolution_

The primary purpose of the dataset is to help security professionals and academic researchers in developing innovative methods of implicitly detecting malicious behavior in smartphones.

Mirsky, Y., Shabtai, A., Rokach, L., Shapira, B., & Elovici, Y. (2016, October). Sherlock vs moriarty: A smartphone dataset for cybersecurity research. In Proceedings of the 2016 ACM workshop on Artificial intelligence and security (pp. 1-12). (PDF)