satellite data in an sql database : opinions? -
10-26-2006
, 11:43 AM
Hi.
I'm trying to figure out whether or not to convert a home-grown data
archiving and processing system in to one based on mySql.
Our data is basically grid data from satellites. There's lots of it.
Typically, we'd have an area of 200 x 200 points every hour of the day
where there's sunlight. So for a ten hour day we'd have
200 x 200 x sizeof(float) x 10 = 1.6Mb a day
We'd typically have several year's worth of data like this and in some
cases the grid size would be considerably larger. Also, it's quite
likely that we'll be storing several parameters, not just one so this
could scale up by an order of magnitude or two.
My interest in using mySql stems from a desire to get this data out of
yearly binary data files and into a large continuous database -- which
doesn't necessarily call for RDBMS functionalily -- but also because I
think the processing of the data could be simplified with selecting and
sorting fuctions being pushed off on the database side. We could do
some interesting stuff with the data afterwards if we had RDBMS
functionality..
The processing of the data takes place in two modes:
1.) the spatial domain, in which we consider just an hourly grid at a
time, usually checking pixel values against their neighbors. In this
case we'd be grabbing all pixels with a particular time stamp or
grid_id.
and
2.) the time domain, in which we process one point at a time, over the
entire data set. In this case we'd be keying off of a pixel_id of some
sort. That's all simple enough but in order to work through the entire
grid space, I'd be making, in the case given above, 1.6 million sql
calls through some DBI in order to process all the data. Actually, I'd
probably be making several times that when you consider that I'll
likely be writing data out as I go.
If anybody has some words of advice or caution, I'd be appreciative.
Performance is pretty important as one of the problems with the current
system is its sluggishness. Case 2 is obviously the big concern here.
I'm aware of the new(ish) spatial extension to the various sql systems,
including mySql, but I'm having trouble finding examples of its use
that are close to my situation. The current system works through all
the data on a time scale of days. I'm hoping to pare that down to
hours.
In summary, I think I'm in the position of making lots and lots of sql
calls on a fairly large dataset from my perl/C code and am wondering if
that's going to slow me down to a crawl. Having our data in sql would
have other advantages, such as ease of web-based distribution, but if
it's ultra-slow I'll have to come up with something else.
Thanks,
Jim Schlemmer |