DARPA 1999 IDS Evaluation Data Set is the most common publicly available data set for intrusion detection. It consists of 5 sets from capturing the activities in a simulated environment. Those sets represent 5 weeks of the capturing process. The first and third week are attack free and usually used for training the intrusion detection, while the second, fourth, and fifth contain malicious traffic. This data set contains raw traffic data, stored in PCAP file. If you prefer to have the preprocessed version, there is KDD 1999 Cup Dataset which has preprocessed raw network traffic from DARPA 1998 Data Set and stores them in CSV files. However, those data sets might be obsolete, since they are now 18 years old. So, I found other data sets for intrusion detection that contain the raw network traffic and list them here.
ISCX 2012 IDS Evaluation Data Set
This dataset contains 7 days of network traffic and stores them in PCAP format. The labels are stored individually in an XML format. I made a simple script to separate the malicious and benign traffic from a PCAP file. If you need the dataset, just email the author and they will give you temporary access to download the PCAP files.
This dataset consists of 2 days recorded traffic which has 100 GB. Apart from the PCAP files, the author also provides preprocessed CSV files, results from BRO and Argus.
HTTP Attack Traffic Data Set by Roberto Perdisci
This dataset contains HTTP traffic from various shellcode. It also has the traffic of morphed shellcodes. You can download and run their proposed IDS as well.
A set of 11 GB packet header traces, stored in PCAP format and anonymised.
Other interesting data sets
These data sets do not necessarily contain raw network traffic, but could still be useful for IDS related research.
HTTP Attacks against web servers
ADFA Intrusion Detection Data Set (For Host-based IDS)
2 thoughts on “Network Traffic Data Set for Intrusion Detection System Research”
Hello, I have read your scripts of dealing with iscx2012 dataset.
I’m confused about the time conversion in iscx_dataset_splitter.py. I know the timestamps are inconsistent in pcap file and xml file, but I wonder why you minus 4 hours spare 10 minutes before and after the written time.
Second, I found that the number of attack packets extracted are different from the sum of netflows in the xml files. Could you explain this problem?
Third, have you ever write any script to divide the packets into different netflows based on the xml files?
Thanks a lot!
1. The 4 hours is to account for timezone difference. From where I live now, there is apparently 4 hours difference. The 10 minutes is to give spare time before the actual time written in the CSV file because sometimes the time in the CSV file and the PCAP is slightly different.
2. I haven’t checked that but I do notice there are duplicated records. This could be the case.
3. No, I haven’t.