For When You Can't Have The Real Thing
[ start | index | login ]
start > dave > experiments > Net Flows > 2009-05-06 > 1

2009-05-06 #1

Created by dave. Last edited by dave, 11 years and 180 days ago. Viewed 2,786 times. #3
[diff] [history] [edit] [rdf]

Changing Gears

This is the problem:

$ /usr/bin/time perl rowcount flows srcip
0.13user 0.02system 23:07.86elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k

The only thing that this script does is count the number of rows. This is the simplest possible query on the database, meaning that anything more complicated will take longer to run.

For the record, there are almost 37 days of data in that table when the script is run. That translates into 379 million rows.

Also, a little messing around showed me that for most forensics-style queries, nfdump through nfsen is going to be faster than querying the database. There still might be value to having all the flows in the database; might be neat to run port number analysys and see if I can do things like detect portscans or whatever, but for the short term, there's no point.

So I decided to go back to the previous design of only storing the IP activity for local IPs for each timeslot; the table is MUCH shorter.

This is a query which returns the local IPs ranked by total bytes (received + sent) for the month of April.

netflow=# select ip, sum(bytesIn + bytesOut) as total from LocalIP \\
    where timeslot > 1238562000 and timeslot < 1241154000 \\
    group by ip \\
    order by total desc;

Useful to know:

$ perl -e 'use Date::Parse; print str2time("2009-05-01 00:00:00 EST5EDT");'

So I now have a script that parses each nfdump file and creates local bytes, in + out for each local IP for each 5 minute timeslot. I have a second perl script that uses the above query, pretty-prints it a bit, and outputs it in order. So I can now identify my top users for a given time period.

Runs in less than four seconds, with no indexes on the table. FOUR SECONDS. Probably because the 380 million flows can be summerized into 1.3 million summary records.


I think that I'm going to stop importing flows into postgres as there isn't really a practical point, at least not with the computers I have available. There's just too much information to be parsed in a reasonable amount of time.


Also potentially useful: calculating the 95th percentile for traffic

  • there are 12 five-minute timeslots in an hour, 24 hours in a day
  • so for a 30 day month: 12*24*30/0.05 = 432 ie, rank all timeslots by number of bytes passed, take the 432nd highest slot, figure out the average rate in that slot, and that's your 95th percentile.
  • for a 31 day month, it works out to slot 446.4, so we'd just round in favor of the customer here to slot 447
  • for a 28 day month: slot 403.2, so slot 404.
  • and once every four years, we have a 29 day, slot 417.6, so slot 418.
This works in favor of the customer, especially considering technical problems on the meter will mean the slot is registered as a zero-byte slot.
no comments | post comment
This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful: | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt