DARPA 1999 IDS Evaluation Data Set is the most common publicly available data set for intrusion detection. It consists of 5 sets from capturing the activities in a simulated environment. Those sets represent 5 weeks of the capturing process. The first and third week are attack free and usually used for training the intrusion detection, while the second, fourth, and fifth contain malicious traffic. This data set contains raw traffic data, stored in PCAP file. If you prefer to have the preprocessed version, there is KDD 1999 Cup Dataset which has preprocessed raw network traffic from DARPA 1998 Data Set and stores them in CSV files. However, those data sets might be obsolete, since they are now 18 years old. So, I found other data sets for intrusion detection that contain the raw network traffic and list them here.

ISCX 2012 IDS Evaluation Data Set

This dataset contains 7 days of network traffic and stores them in PCAP format. The labels are stored individually in an XML format. I made a simple script to separate the malicious and benign traffic from a PCAP file. If you need the dataset, just email the author and they will give you temporary access to download the PCAP files.

UNSW-NB15 Data Set

This dataset consists of 2 days recorded traffic which has 100 GB. Apart from the PCAP files, the author also provides preprocessed CSV files, results from BRO and Argus.

HTTP Attack Traffic Data Set by Roberto Perdisci

This dataset contains HTTP traffic from various shellcode. It also has the traffic of morphed shellcodes. You can download and run their proposed IDS as well.

LNBL 05 Data Set

A set of 11 GB packet header traces, stored in PCAP format and anonymised.

Other interesting data sets

These data sets do not necessarily contain raw network traffic, but could still be useful for IDS related research.

HTTP Attacks against web servers

ADFA Intrusion Detection Data Set (For Host-based IDS)

2 thoughts on “Network Traffic Data Set for Intrusion Detection System Research

  • Mex

    Hello, I have read your scripts of dealing with iscx2012 dataset.
    I’m confused about the time conversion in iscx_dataset_splitter.py. I know the timestamps are inconsistent in pcap file and xml file, but I wonder why you minus 4 hours spare 10 minutes before and after the written time.
    Second, I found that the number of attack packets extracted are different from the sum of netflows in the xml files. Could you explain this problem?
    Third, have you ever write any script to divide the packets into different netflows based on the xml files?

    Thanks a lot!

    Reply
    • baskoro

      1. The 4 hours is to account for timezone difference. From where I live now, there is apparently 4 hours difference. The 10 minutes is to give spare time before the actual time written in the CSV file because sometimes the time in the CSV file and the PCAP is slightly different.
      2. I haven’t checked that but I do notice there are duplicated records. This could be the case.
      3. No, I haven’t.

      Reply

Leave a comment to baskoro Cancel reply

Your email address will not be published.

%d bloggers like this: