Progress today 
Changed the way the script works: now the call to commit the inserts to the database is outside the loop. This approximately halves execution time:
foreach $ip (keys %LocalTraffic)
{
$statement="insert into LocalIP values ($timeStamp,'$ip',$LocalTraffic{$ip});";
$sth = $database->prepare($statement);
die "$DBI::errstr" if (!defined sth);
$sth->execute;
$sth->finish;
}
$database->commit;
I was getting some errors talking about the parameter being too large for an integer field, so I changed the table structure for LocalIP to use bigint instead:
flowdb=# drop table LocalIP;
flowdb=# create table LocalIP ( timeslot int, ip varchar(16), bytes bigint );
flowdb=# grant all on table LocalIP to flowdb;
...although using the bigint now means the script time has quadrupled or so. Interesting.
Dry-run import of the data takes a long time. Six or seven seconds per 15-minute file.
Changed the way flow-capture is invoked:
flow-capture -w /var/spool/flows -n 288 -S 5 0/0/999
The parameter
-n 288 tells the capture system to break up the capture into five-minute long files instead of the default, fifteen-minute long files. (288 is the number of five-minute intervals in 24 hours.)
This will mean that the old data is practically useless, unless we want to re-parse the existing 15-minute interval data down into five-minute intervals. Or maybe what we should be doing is just round every datapoint down to the appropriate five-minute mark… although that would mean keeping three times the amount of tables in memory while parsing the files. Hmmm. On the other hand, we really only need five-minute-interval data from the last day as everything else will be averaged out by the rrdb.
I think that parsing it out into five-minute intervals is the "right thing to do", even though it increases the amount of data we hold in memory, and it means we effectively have to do a read on the database for each record to ensure we are not clobbering a record from a previous data file and will instead add to it.
That's for another day.