Pandas, similar to Numpy, is one of Python data analysis libraries. It has a feature that allows us to filter rows from a DataFrame based on certain criteria and it’s really easy to do it. For instance, I have a DataFrame df that has four columns, e.g. source IP address, source port, destination IP address, and destination port. When I want to search for a pair of source IP address and port, I can use this following line:

rows = df[(df["src_ip_addr"] == "192.168.66.6") & (df["src_port"] == 12345)]

However, I found out that this approach is really slow, particularly when you have a huge DataFrame. Surprisingly, there is a simple trick to fasten that code by using the .values attribute in the filtering criteria. So, the previous command can be modified as follows:

rows = df[(df["src_ip_addr"].values == "192.168.66.6") & (df["src_port"].values == 12345)]

In my case, that change gave me 4 times faster running time.

One thought on “Speeding Up Pandas Rows Filtering

  • Mex

    Good trick!

    Reply

Leave a comment

Your email address will not be published.

%d bloggers like this: