dbTalk Databases Forums  

Data warehouse

comp.databases.mysql comp.databases.mysql


Discuss Data warehouse in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Ignoramus3367
 
Posts: n/a

Default Data warehouse - 09-08-2011 , 01:33 PM






I am building a data warehouse for data mining purposes.

It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of
memory, 12 intel CPUs, 3Ware 9750 for RAID card.

It will have several tables, with one biggest by far.

My design principles are

1) Use InnoDB with a separate file per table

2) Use the main tables ONLY for batch queries -- no accesses to get
"just one item" (other than for debugging).

3) All INSERTS to be done into temporary tables, and when they are
done, INSERT SELECT into the main table.

4) If some project using this data warehouse needs quick access to
data, then it should select the data it needs into its own table (on a
different server perhaps) and then access it all it wants.

5) I should have an INDEX for all popular queries.

Could you criticize anything in this approach.

Thanks

Reply With Quote
  #2  
Old   
Peter H. Coffin
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 05:32 PM






On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote:
Quote:
I am building a data warehouse for data mining purposes.

It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of
memory, 12 intel CPUs, 3Ware 9750 for RAID card.

It will have several tables, with one biggest by far.

My design principles are

1) Use InnoDB with a separate file per table
I'm curious as to your thinking on this one.

Quote:
2) Use the main tables ONLY for batch queries -- no accesses to get
"just one item" (other than for debugging).
Versus what, exactly? Range-partitioned into separate tables data?

Quote:
3) All INSERTS to be done into temporary tables, and when they are
done, INSERT SELECT into the main table.
*Probably* not a bad idea, but depending on the data, that might not be
necessary or it might not be enough.

Quote:
4) If some project using this data warehouse needs quick access to
data, then it should select the data it needs into its own table (on a
different server perhaps) and then access it all it wants.
Great idea! If the other server also has 24TB of storage and 48GB of
RAM... Of course, they may not need you anymore in that case.

Quote:
5) I should have an INDEX for all popular queries.
This one goes without saying. However, there may be means other than a
plain index that better suits things. Maybe partitioning sales data by
year might help more.

Quote:
Could you criticize anything in this approach.
A lot of these are principles that are very good in theory, but the
needs of the data and how much of it needs to be accessed at once, and
for what reason. The most important things to consider when building a
data warehouse 1) HOW is the data to be accessed, and 2) WHAT does the
data look like? And, yes, those are in that order. You shouldn't REALLY
be settling on what RDBMS you're going to use until you've gotten those
questions straightened out. (Oh, and #3 is "How are you going to back it
up?")

--
I'm not sure if this is a good or a bad thing.
Probably a bad thing; most things are bad things.
-- Nile Evil Bastard

Reply With Quote
  #3  
Old   
Ignoramus3367
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 06:32 PM



On 2011-09-08, Peter H. Coffin <hellsop (AT) ninehells (DOT) com> wrote:
Quote:
On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote:
I am building a data warehouse for data mining purposes.

It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of
memory, 12 intel CPUs, 3Ware 9750 for RAID card.

It will have several tables, with one biggest by far.

My design principles are

1) Use InnoDB with a separate file per table

I'm curious as to your thinking on this one.
Not much, I just do not want this database intermixed with any other.

Quote:
2) Use the main tables ONLY for batch queries -- no accesses to get
"just one item" (other than for debugging).

Versus what, exactly? Range-partitioned into separate tables data?
Well, say, let me give you an example. Suppose that this is a dataset
of various widgets. (it is not, just an example)

And further suppose that I want to open a website that sells only
"blue widgets" out of this dataset. www.bluewidgets.com.

A most simplistic approach would be to have a website with code that
does SELECT ... WHERE title LIKE '%blue widget%' to present products,
and then requests a page from the main table for every product that it
needs to display.

I reject such an approach. Instead, I would demand that the blue
widget website has its own table, that it would populate it from the
master table once a day or some such, as a batch process.

Quote:
3) All INSERTS to be done into temporary tables, and when they are
done, INSERT SELECT into the main table.

*Probably* not a bad idea, but depending on the data, that might not be
necessary or it might not be enough.
I think that it is my best shot.

Quote:
4) If some project using this data warehouse needs quick access to
data, then it should select the data it needs into its own table (on a
different server perhaps) and then access it all it wants.

Great idea! If the other server also has 24TB of storage and 48GB of
RAM... Of course, they may not need you anymore in that case.
Well, I hope that sub-selections would be a lot smaller, this is my
unstated assumption.

Quote:
5) I should have an INDEX for all popular queries.

This one goes without saying. However, there may be means other than a
plain index that better suits things. Maybe partitioning sales data by
year might help more.
You know, I looked into partitioning. The partition variable would
need to be a part of PRIMARY KEY, which makes no sense to me in this
application. I have IDs available in the data that are the key.

Quote:
Could you criticize anything in this approach.

A lot of these are principles that are very good in theory, but the
needs of the data and how much of it needs to be accessed at once, and
for what reason. The most important things to consider when building a
data warehouse 1) HOW is the data to be accessed, and 2) WHAT does the
data look like? And, yes, those are in that order. You shouldn't REALLY
be settling on what RDBMS you're going to use until you've gotten those
questions straightened out. (Oh, and #3 is "How are you going to back it
up?")
Thanks. I am relatively fuzzy on both first questions (stupid me).

The backup is my biggest PITA, but I have devised an approach of
making numerous TGZ files with 1,000 related records in each, that I
will be storing on open top Cavalry USB drives.

i

Reply With Quote
  #4  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 09:31 PM



On 9/8/2011 7:32 PM, Ignoramus3367 wrote:
Quote:
On 2011-09-08, Peter H. Coffin<hellsop (AT) ninehells (DOT) com> wrote:
On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote:
I am building a data warehouse for data mining purposes.

It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of
memory, 12 intel CPUs, 3Ware 9750 for RAID card.

It will have several tables, with one biggest by far.

My design principles are

1) Use InnoDB with a separate file per table

I'm curious as to your thinking on this one.

Not much, I just do not want this database intermixed with any other.

2) Use the main tables ONLY for batch queries -- no accesses to get
"just one item" (other than for debugging).

Versus what, exactly? Range-partitioned into separate tables data?

Well, say, let me give you an example. Suppose that this is a dataset
of various widgets. (it is not, just an example)

And further suppose that I want to open a website that sells only
"blue widgets" out of this dataset. www.bluewidgets.com.

A most simplistic approach would be to have a website with code that
does SELECT ... WHERE title LIKE '%blue widget%' to present products,
and then requests a page from the main table for every product that it
needs to display.

I reject such an approach. Instead, I would demand that the blue
widget website has its own table, that it would populate it from the
master table once a day or some such, as a batch process.

3) All INSERTS to be done into temporary tables, and when they are
done, INSERT SELECT into the main table.

*Probably* not a bad idea, but depending on the data, that might not be
necessary or it might not be enough.

I think that it is my best shot.

4) If some project using this data warehouse needs quick access to
data, then it should select the data it needs into its own table (on a
different server perhaps) and then access it all it wants.

Great idea! If the other server also has 24TB of storage and 48GB of
RAM... Of course, they may not need you anymore in that case.

Well, I hope that sub-selections would be a lot smaller, this is my
unstated assumption.

5) I should have an INDEX for all popular queries.

This one goes without saying. However, there may be means other than a
plain index that better suits things. Maybe partitioning sales data by
year might help more.

You know, I looked into partitioning. The partition variable would
need to be a part of PRIMARY KEY, which makes no sense to me in this
application. I have IDs available in the data that are the key.

Could you criticize anything in this approach.

A lot of these are principles that are very good in theory, but the
needs of the data and how much of it needs to be accessed at once, and
for what reason. The most important things to consider when building a
data warehouse 1) HOW is the data to be accessed, and 2) WHAT does the
data look like? And, yes, those are in that order. You shouldn't REALLY
be settling on what RDBMS you're going to use until you've gotten those
questions straightened out. (Oh, and #3 is "How are you going to back it
up?")

Thanks. I am relatively fuzzy on both first questions (stupid me).

The backup is my biggest PITA, but I have devised an approach of
making numerous TGZ files with 1,000 related records in each, that I
will be storing on open top Cavalry USB drives.

i
I do agree with Peter on one thing - you shouldn't be selecting the
RDBMS you're using yet. You don't even have a clear idea of WHAT you're
going to do, much less HOW you're going to do it.

And some of your ideas on operations are questionable - like inserting
into a temp table then into the main table. You very well could be
hurting performance.

Personally, I wouldn't use MySQL for such a large database; it will
work, but there are other products better suited to such things.



--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
  #5  
Old   
Ignoramus3367
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 09:43 PM



On 2011-09-09, Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote:
Quote:
I do agree with Peter on one thing - you shouldn't be selecting the
RDBMS you're using yet. You don't even have a clear idea of WHAT you're
going to do, much less HOW you're going to do it.

And some of your ideas on operations are questionable - like inserting
into a temp table then into the main table. You very well could be
hurting performance.

Personally, I wouldn't use MySQL for such a large database; it will
work, but there are other products better suited to such things.
I am EXTREMELY open to suggestions here. My experience with large
mysql datasets have not been very encouraging. I am all ears and I
would love to hear more.

What else? Postgres?

i

Reply With Quote
  #6  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 09:50 PM



On 9/8/2011 10:43 PM, Ignoramus3367 wrote:
Quote:
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote:
I do agree with Peter on one thing - you shouldn't be selecting the
RDBMS you're using yet. You don't even have a clear idea of WHAT you're
going to do, much less HOW you're going to do it.

And some of your ideas on operations are questionable - like inserting
into a temp table then into the main table. You very well could be
hurting performance.

Personally, I wouldn't use MySQL for such a large database; it will
work, but there are other products better suited to such things.

I am EXTREMELY open to suggestions here. My experience with large
mysql datasets have not been very encouraging. I am all ears and I
would love to hear more.

What else? Postgres?

i
I would recommend you try a general database newsgroup, or one on data
warehousing (if there is one).

But no, I wouldn't use Postgres. I'd get an industrial strength
database. Check out DB2, Oracle and SQL Server, for instance.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
  #7  
Old   
Ignoramus3367
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 09:58 PM



On 2011-09-09, Jerry Stuckle <jstucklex (AT) attglobal (DOT) net> wrote:
Quote:
On 9/8/2011 10:43 PM, Ignoramus3367 wrote:
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote:
I do agree with Peter on one thing - you shouldn't be selecting the
RDBMS you're using yet. You don't even have a clear idea of WHAT you're
going to do, much less HOW you're going to do it.

And some of your ideas on operations are questionable - like inserting
into a temp table then into the main table. You very well could be
hurting performance.

Personally, I wouldn't use MySQL for such a large database; it will
work, but there are other products better suited to such things.

I am EXTREMELY open to suggestions here. My experience with large
mysql datasets have not been very encouraging. I am all ears and I
would love to hear more.

What else? Postgres?

i

I would recommend you try a general database newsgroup, or one on data
warehousing (if there is one).

But no, I wouldn't use Postgres. I'd get an industrial strength
database. Check out DB2, Oracle and SQL Server, for instance.

But they all involve BIG money (and SQL Server involves Microsoft),
no?

i

Reply With Quote
  #8  
Old   
fabrice régnier
 
Posts: n/a

Default Re: Data warehouse - 09-08-2011 , 11:38 PM



Hi,

Quote:
But no, I wouldn't use Postgres. I'd get an industrial strength
database. Check out DB2, Oracle and SQL Server, for instance.
What is the most biggest database did you used with postgres ? I have a
work experience with 1.5To in food processing industry and it works fine.

Have a look here: http://www.postgresql.org/about/ and here
http://www.postgresql.fr/temoignages:meteo_france. Sorry, it's in
french. In brief, the public french weather company runs lots of its
database with postgres, the biggest is about 3.5 To used for
datawarehouse. They planned to migrate all their Oracle db to Postgres
db for (essentialy) cost reason.

Sorry to be so short, it's a mysql ng here

To be back to your project, to build a good datawarehouse is very
complex. The pragmatic approach is to write down all present queries
your db will be asked (and try to think about futures queries, but most
of the time, it's already science fiction ) then denormalize all what
you need. Forget about third normal form and what your teachers told you
years ago about clean database model

Well, so much to say about building datawarehouse... we even didn't
talked about hypertable (which is not a rdbms) and i'm not sure this is
the good place for it, though.

regards,

f.

Reply With Quote
  #9  
Old   
Axel Schwenke
 
Posts: n/a

Default Re: Data warehouse - 09-09-2011 , 02:52 AM



Ignoramus3367 <ignoramus3367 (AT) NOSPAM (DOT) 3367.invalid> wrote:
Quote:
On 2011-09-08, Peter H. Coffin <hellsop (AT) ninehells (DOT) com> wrote:
On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote:
I am building a data warehouse for data mining purposes.

It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of
memory, 12 intel CPUs, 3Ware 9750 for RAID card.

It will have several tables, with one biggest by far.

My design principles are

1) Use InnoDB with a separate file per table

I'm curious as to your thinking on this one.

Not much, I just do not want this database intermixed with any other.
That's not an answer, really.

InnoDB does not shine too much if your data set is much larger than
the buffer pool because InnoDB funnels *all* I/O through that pool.

MyISAM OTOH puts only indexes in memory and is thus better in this
situation. However you certainly don't want to use MyISAM for such
amounts of data because it would take ages to check or repair a table
after a crash. But then MariaDB was designed exactly for that: it's
crashsafe and uses memory in a comparable fashion.

There are also third party storage engines for MySQL that are targeted
at data mining:

http://www.mysql.com/why-mysql/data-warehouse.html

I have no hands-on experience with those though.


XL

Reply With Quote
  #10  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: Data warehouse - 09-09-2011 , 05:29 AM



On 9/8/2011 10:58 PM, Ignoramus3367 wrote:
Quote:
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote:
On 9/8/2011 10:43 PM, Ignoramus3367 wrote:
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote:
I do agree with Peter on one thing - you shouldn't be selecting the
RDBMS you're using yet. You don't even have a clear idea of WHAT you're
going to do, much less HOW you're going to do it.

And some of your ideas on operations are questionable - like inserting
into a temp table then into the main table. You very well could be
hurting performance.

Personally, I wouldn't use MySQL for such a large database; it will
work, but there are other products better suited to such things.

I am EXTREMELY open to suggestions here. My experience with large
mysql datasets have not been very encouraging. I am all ears and I
would love to hear more.

What else? Postgres?

i

I would recommend you try a general database newsgroup, or one on data
warehousing (if there is one).

But no, I wouldn't use Postgres. I'd get an industrial strength
database. Check out DB2, Oracle and SQL Server, for instance.


But they all involve BIG money (and SQL Server involves Microsoft),
no?

i
Not necessarily. But then you also get what you pay for.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.