For When You Can't Have The Real Thing
[ start | index | login ]
start > dave > experiments > Net Flows > 2007-11-20 > 1

2007-11-20 #1

Created by dave. Last edited by dave, 16 years and 109 days ago. Viewed 2,250 times. #3
[diff] [history] [edit] [rdf]


The data load gets slow with all this data in the database. 11 days of data took over 730 minutes to load. Running the cleanup insert (ie the time between the run-time and now) is taking a large amount of time per file now. Detection/removal/correction process is taking almost a whole second per record.

This is probably because every record we want to insert we have to do the lookup on to make sure we are not clobbering. There are a couple of ways around this:

  • just write the new record and make postgres2rrd deal with the multiple records per time period/ip combination
  • create some method where we have a pre-load table of some kind that only contains timeperiods likely to have clobbers (ie, the last time period written from the previous file) so lookups are fast; then re-populate the table with the data from the last time period from this time so it's available for the next file.
Creation of the RRD files is also pretty slow, three time periods per second or so.

I'm probably totally I/O bound here:

  • reading the flows and writing the database is all disk
  • reading the database and writing the rrd files is all disk there won't be much to be gained by throwing more CPU at the problem.

Wrote a shell script to do the graph generation, it's pretty speedy. Less than four seconds. So on-demand generation isn't going to be unreasonable.

Changed the table layout:

  • removed bytes
  • added bytesIn, bytesOut
  • added a new table which tracks which flow files have been inserted
postgres2rrd now knows to skip the last time period it's offered, record it, then exclude all timeperiods prior to that the next time it's run. This is stored in the table rrdload. It also knows to not delete the rrds.

To do:

  • switch to tell postgres2rrd to generate the rrds from scratch
  • figure out if this is robust enough for the collector to go back to 5-minute interval files (probably not)
  • generate a simple web page with the IPs and graphs on it
  • figure out how to generate the graphs dynamically
  • generate tables based on top talkers, top listeners (this is probably simple sql query voodoo)
I think at this point I can load the data and then pretty much forget about having to re-load it from scratch. Again.

Oh yeah, forgot to mention: once you convert the values to bits per second, the graphs make much more sense.

no comments | post comment
This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful: | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt