![]() | |
#1
| |||
| |||
|
#2
| ||||||
| ||||||
|
|
I am building a data warehouse for data mining purposes. It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of memory, 12 intel CPUs, 3Ware 9750 for RAID card. It will have several tables, with one biggest by far. My design principles are 1) Use InnoDB with a separate file per table |
|
2) Use the main tables ONLY for batch queries -- no accesses to get "just one item" (other than for debugging). |
|
3) All INSERTS to be done into temporary tables, and when they are done, INSERT SELECT into the main table. |
|
4) If some project using this data warehouse needs quick access to data, then it should select the data it needs into its own table (on a different server perhaps) and then access it all it wants. |
|
5) I should have an INDEX for all popular queries. |
|
Could you criticize anything in this approach. |
#3
| ||||||
| ||||||
|
|
On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote: I am building a data warehouse for data mining purposes. It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of memory, 12 intel CPUs, 3Ware 9750 for RAID card. It will have several tables, with one biggest by far. My design principles are 1) Use InnoDB with a separate file per table I'm curious as to your thinking on this one. |
|
2) Use the main tables ONLY for batch queries -- no accesses to get "just one item" (other than for debugging). Versus what, exactly? Range-partitioned into separate tables data? |
|
3) All INSERTS to be done into temporary tables, and when they are done, INSERT SELECT into the main table. *Probably* not a bad idea, but depending on the data, that might not be necessary or it might not be enough. |
|
4) If some project using this data warehouse needs quick access to data, then it should select the data it needs into its own table (on a different server perhaps) and then access it all it wants. Great idea! If the other server also has 24TB of storage and 48GB of RAM... Of course, they may not need you anymore in that case. |
|
5) I should have an INDEX for all popular queries. This one goes without saying. However, there may be means other than a plain index that better suits things. Maybe partitioning sales data by year might help more. |
|
Could you criticize anything in this approach. A lot of these are principles that are very good in theory, but the needs of the data and how much of it needs to be accessed at once, and for what reason. The most important things to consider when building a data warehouse 1) HOW is the data to be accessed, and 2) WHAT does the data look like? And, yes, those are in that order. You shouldn't REALLY be settling on what RDBMS you're going to use until you've gotten those questions straightened out. (Oh, and #3 is "How are you going to back it up?") |
#4
| |||
| |||
|
|
On 2011-09-08, Peter H. Coffin<hellsop (AT) ninehells (DOT) com> wrote: On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote: I am building a data warehouse for data mining purposes. It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of memory, 12 intel CPUs, 3Ware 9750 for RAID card. It will have several tables, with one biggest by far. My design principles are 1) Use InnoDB with a separate file per table I'm curious as to your thinking on this one. Not much, I just do not want this database intermixed with any other. 2) Use the main tables ONLY for batch queries -- no accesses to get "just one item" (other than for debugging). Versus what, exactly? Range-partitioned into separate tables data? Well, say, let me give you an example. Suppose that this is a dataset of various widgets. (it is not, just an example) And further suppose that I want to open a website that sells only "blue widgets" out of this dataset. www.bluewidgets.com. A most simplistic approach would be to have a website with code that does SELECT ... WHERE title LIKE '%blue widget%' to present products, and then requests a page from the main table for every product that it needs to display. I reject such an approach. Instead, I would demand that the blue widget website has its own table, that it would populate it from the master table once a day or some such, as a batch process. 3) All INSERTS to be done into temporary tables, and when they are done, INSERT SELECT into the main table. *Probably* not a bad idea, but depending on the data, that might not be necessary or it might not be enough. I think that it is my best shot. 4) If some project using this data warehouse needs quick access to data, then it should select the data it needs into its own table (on a different server perhaps) and then access it all it wants. Great idea! If the other server also has 24TB of storage and 48GB of RAM... Of course, they may not need you anymore in that case. Well, I hope that sub-selections would be a lot smaller, this is my unstated assumption. 5) I should have an INDEX for all popular queries. This one goes without saying. However, there may be means other than a plain index that better suits things. Maybe partitioning sales data by year might help more. You know, I looked into partitioning. The partition variable would need to be a part of PRIMARY KEY, which makes no sense to me in this application. I have IDs available in the data that are the key. Could you criticize anything in this approach. A lot of these are principles that are very good in theory, but the needs of the data and how much of it needs to be accessed at once, and for what reason. The most important things to consider when building a data warehouse 1) HOW is the data to be accessed, and 2) WHAT does the data look like? And, yes, those are in that order. You shouldn't REALLY be settling on what RDBMS you're going to use until you've gotten those questions straightened out. (Oh, and #3 is "How are you going to back it up?") Thanks. I am relatively fuzzy on both first questions (stupid me). The backup is my biggest PITA, but I have devised an approach of making numerous TGZ files with 1,000 related records in each, that I will be storing on open top Cavalry USB drives. i |
#5
| |||
| |||
|
|
I do agree with Peter on one thing - you shouldn't be selecting the RDBMS you're using yet. You don't even have a clear idea of WHAT you're going to do, much less HOW you're going to do it. And some of your ideas on operations are questionable - like inserting into a temp table then into the main table. You very well could be hurting performance. Personally, I wouldn't use MySQL for such a large database; it will work, but there are other products better suited to such things. |
#6
| |||
| |||
|
|
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote: I do agree with Peter on one thing - you shouldn't be selecting the RDBMS you're using yet. You don't even have a clear idea of WHAT you're going to do, much less HOW you're going to do it. And some of your ideas on operations are questionable - like inserting into a temp table then into the main table. You very well could be hurting performance. Personally, I wouldn't use MySQL for such a large database; it will work, but there are other products better suited to such things. I am EXTREMELY open to suggestions here. My experience with large mysql datasets have not been very encouraging. I am all ears and I would love to hear more. What else? Postgres? i |
#7
| |||
| |||
|
|
On 9/8/2011 10:43 PM, Ignoramus3367 wrote: On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote: I do agree with Peter on one thing - you shouldn't be selecting the RDBMS you're using yet. You don't even have a clear idea of WHAT you're going to do, much less HOW you're going to do it. And some of your ideas on operations are questionable - like inserting into a temp table then into the main table. You very well could be hurting performance. Personally, I wouldn't use MySQL for such a large database; it will work, but there are other products better suited to such things. I am EXTREMELY open to suggestions here. My experience with large mysql datasets have not been very encouraging. I am all ears and I would love to hear more. What else? Postgres? i I would recommend you try a general database newsgroup, or one on data warehousing (if there is one). But no, I wouldn't use Postgres. I'd get an industrial strength database. Check out DB2, Oracle and SQL Server, for instance. |
#8
| |||
| |||
|
|
But no, I wouldn't use Postgres. I'd get an industrial strength database. Check out DB2, Oracle and SQL Server, for instance. What is the most biggest database did you used with postgres ? I have a |

) then denormalize all what
#9
| |||
| |||
|
|
On 2011-09-08, Peter H. Coffin <hellsop (AT) ninehells (DOT) com> wrote: On Thu, 08 Sep 2011 13:33:51 -0500, Ignoramus3367 wrote: I am building a data warehouse for data mining purposes. It will be a RAID 6 array with 12 2TB drives. Server had 48 GB of memory, 12 intel CPUs, 3Ware 9750 for RAID card. It will have several tables, with one biggest by far. My design principles are 1) Use InnoDB with a separate file per table I'm curious as to your thinking on this one. Not much, I just do not want this database intermixed with any other. |
#10
| |||
| |||
|
|
On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote: On 9/8/2011 10:43 PM, Ignoramus3367 wrote: On 2011-09-09, Jerry Stuckle<jstucklex (AT) attglobal (DOT) net> wrote: I do agree with Peter on one thing - you shouldn't be selecting the RDBMS you're using yet. You don't even have a clear idea of WHAT you're going to do, much less HOW you're going to do it. And some of your ideas on operations are questionable - like inserting into a temp table then into the main table. You very well could be hurting performance. Personally, I wouldn't use MySQL for such a large database; it will work, but there are other products better suited to such things. I am EXTREMELY open to suggestions here. My experience with large mysql datasets have not been very encouraging. I am all ears and I would love to hear more. What else? Postgres? i I would recommend you try a general database newsgroup, or one on data warehousing (if there is one). But no, I wouldn't use Postgres. I'd get an industrial strength database. Check out DB2, Oracle and SQL Server, for instance. But they all involve BIG money (and SQL Server involves Microsoft), no? i |
![]() |
| Thread Tools | |
| Display Modes | |
| |