![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
#3
| |||
| |||
|
|
Personally I would first load it into a staging table, e.g. STG_SOURCE with 1 columns called SOURCEFIELD of NVARCHAR(1000) So, only using the datapump's Copy Column function. Or even better, do a BULK INSERT. Then you would have one table with one field with 56,171,938 records. After that, do your cleansing, e.g. with a massive SQL insert statement into a second table with the proper columns. Let that statement perform your calculations, etc. I bet this will be much faster. |
#4
| |||
| |||
|
|
That sounds like a workable alternative to me, though I would like to add a couple of minor points. First, unless the input is unicode I would use VARCHAR rather than NVARCHAR. NVARCHAR would take twice the space, rather costly on a table of this size if not required. Second, I suggest that it be a two-column table, an INT identity column and the (N)VARCHAR column. This could be a major advantage when it comes to loading the real table from the staging table. I would VERY much want to be able to do that in batches, and having an identity column would make working in batches trivial. And on the subject of working in batches, if the entire table were transformed in one pass it would require double storage space. If done in batches - say 100,000 rows at a time, to pick a number out of the air - the processed rows for a batch could be deleted from the staging table after the batch is loaded. Space released from the staging table would become available for the real table. Roy Harvey Beacon Falls, CT On Thu, 13 Jul 2006 07:43:02 -0700, Frans van Bree FransvanBree (AT) discussions (DOT) microsoft.com> wrote: Personally I would first load it into a staging table, e.g. STG_SOURCE with 1 columns called SOURCEFIELD of NVARCHAR(1000) So, only using the datapump's Copy Column function. Or even better, do a BULK INSERT. Then you would have one table with one field with 56,171,938 records. After that, do your cleansing, e.g. with a massive SQL insert statement into a second table with the proper columns. Let that statement perform your calculations, etc. I bet this will be much faster. |
#5
| |||
| |||
|
|
First, unless the input is unicode I would use VARCHAR rather than NVARCHAR. NVARCHAR would take twice the space, rather costly on a table of this size if not required. |
|
would VERY much want to be able to do that in batches, and having an identity column would make working in batches trivial. |
|
And on the subject of working in batches, if the entire table were transformed in one pass it would require double storage space. If done in batches - say 100,000 rows at a time, to pick a number out of the air - the processed rows for a batch could be deleted from the staging table after the batch is loaded. Space released from the staging table would become available for the real table. |
#6
| |||
| |||
|
|
And on the subject of working in batches, if the entire table were transformed in one pass it would require double storage space. If done in batches - say 100,000 rows at a time, to pick a number out of the air - the processed rows for a batch could be deleted from the staging table after the batch is loaded. Space released from the staging table would become available for the real table. Space is not released to the OS if these actions are logged. It is not recommended to turn on "auto-shrink" for a database, so the db would no shrink/remain the same, but still increase in your scenario. In the same order as in my scenario. Ofcourse, in my scenario you could truncate the staging after you're finished. > 1000x faster than deleting 100000 rows after each batch. I even expect the deletes would turn him back to his original scenario, that slow it is. |
#7
| |||
| |||
|
|
referring to releasing space back to the OS, just freeing up space within the database for the other table. |
#8
| |||
| |||
|
|
referring to releasing space back to the OS, just freeing up space within the database for the other table. You're right. I was thinking of that after I sent the post. I am still thinking on doing a test to see it the database file (MDF) will or will not increase in your scenario if auto-shrink is off. Because I am not so sure after all that it will increase (that much). {Depending of course on the how much bytes there are extra in the fields of the second table, because the transformations can create additional data.} |
#9
| |||
| |||
|
|
On Thu, 13 Jul 2006 10:13:01 -0700, Frans van Bree FransvanBree (AT) discussions (DOT) microsoft.com> wrote: And on the subject of working in batches, if the entire table were transformed in one pass it would require double storage space. If done in batches - say 100,000 rows at a time, to pick a number out of the air - the processed rows for a batch could be deleted from the staging table after the batch is loaded. Space released from the staging table would become available for the real table. Space is not released to the OS if these actions are logged. It is not recommended to turn on "auto-shrink" for a database, so the db would no shrink/remain the same, but still increase in your scenario. In the same order as in my scenario. Ofcourse, in my scenario you could truncate the staging after you're finished. > 1000x faster than deleting 100000 rows after each batch. I even expect the deletes would turn him back to his original scenario, that slow it is. Good point on the performance of the DELETEs. However, I was not referring to releasing space back to the OS, just freeing up space within the database for the other table. Roy |
#10
| |||
| |||
|
|
A couple of update comments about my big file processing. One must use a format file if the Identity column is defined since the source data does not have an identity field. |
|
Question: The log file starts to get really big. I looked for a way to compact the data/log file but did not find anything. Does a way exist to force an empty of the log file and then a compression? |
![]() |
| Thread Tools | |
| Display Modes | |
| |