For When You Can't Have The Real Thing
[ start | index | login ]
start > dave > experiments > Net Flows > 2007-11-09 > 1

2007-11-09 #1

Created by dave. Last edited by dave, 16 years and 30 days ago. Viewed 2,253 times. #3
[diff] [history] [edit] [rdf]

Oh, crap

Good: converting the timeperiod of the flow end to the next five minute interval is easy:
# this is the time of the flow end in seconds
# round up to the next five minute interval.
$timeperiod=$timeperiod - ($timeperiod % 300) + 300;

Yeah, this isn't leap-second safe -- it isn't even leap-second aware, but then str2time isn't either.

Problem: since we can't be sure that a previous flow file didn't include a result that would nominally end in the same time period as the record we are currently considering, we have to do database lookups on each potential record before writing the updated record out. This makes things incredibly slow (ie our run time goes from 8 seconds per flow file to… well, I killed it after five minutes).

If we go back to the flow files only being written every fifteen minutes, we can guarantee interval integrity and we no longer have to check each potential record before updating the database.

Okay, it turns out that my code was wrong the first time I ran this: I wasn't doing the rounding to the next five-minute interval, meaning that I had a lot more data points than I expected. Running the script with the rounding, plus with the checking removed, executes in about 25 seconds per 15-minute interval file -- that pass through time2str for each record really hurts. I'm guessing that the check-before-set will add another minute to this run (assuming ~ 325 records to check at 5 record checks per second which is approximately what I observed the progress run earlier), but I'd have to put the check code back in to find out… and I deleted that block instead of commenting out (bad, bad Dave!).

I guess I'm suddenly wanting a version control system.

Doing the checking is "correct", although I don't immediately know how to tell the difference between a record that was entered from a different file and a record that was a result of a previous run from the same file. Is this important? I think we can assume that the database will be the result of a single pass through the data since that is the way we are going to set it up.

Based on all this, I've gone back to the first invocation of flow-capture:

flow-capture -w /var/spool/flows -S 5 0/0/999
...because either
  • that way I know that my captured flows have fifteen-minute breakpoints and I can be lazy about my checking; or
  • if I'm going to do checking anyways, I might as well have fewer files to deal with.
no comments | post comment
This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful: | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt