![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||||
| |||||
|
|
We have an application that stores all its information in a database (Oracle 11, as it happens). Between the application and the database sits a third-party object-relational mapper. And not just any sort of object-relational mapper, but one that implements a versioned object model - think of it as source control for objects. A fully operational instance of the application has about 300 000 objects in the database (according to one way of counting them, at least), corresponding to a somewhat larger number of rows (less than a million, i'd guess), spread over a few hundred tables. We like building fresh instances from scratch. We do it as part of our build process, along with compiling the code and so on, to make sure that we can always build a working system from the raw materials in CVS. This process involves clearing out everything in the schema (dropping every object and then purging the recyclebin), running all our DDL, and then loading the 300 000 objects through the versioned mapper. This process is not as fast as we'd like it to be - the DDL is fast, but the data loading takes something like 45 minutes. Let's say, for the sake of argument if nothing else, that we cannot abandon building from scratch, or avoid the versioned mapper, or do anything about the speed or behaviour of the mapper itself. |
|
What could we try to make the load faster? |
|
I don't have detailed statistics to hand, but one thing we've noticed is that there is a very high ratio of selects to upserts during the load; there are numerous (>10) queries made concerning each object before it is inserted - i think the mapper is checking to see if there is an existing version of the object, then inserting the loaded version, performing some sort of merge or check-in operation, etc. We're in the process of tuning our indexes to make them more useful during the load, but a major concern is that the query planner is not making use of them - we quite often see that the most time-consuming queries are ones which should be able to make good use of an index, but are being planned as table scans. Our theory is that this is because the statistics for the tables have not been updated since they were created, and so it looks like they're empty, even as they grow to some vast size. Does that seem plausible? |
|
If so, what can we do about it? We could do a bit of bodging to run dbms_stats.gather_table_stats periodically during the load; would that be sensible? |
|
Is there some way we can arrange for statistics updates to be triggered automatically as the tables grow? A colleague swears that there's some way to tell the planner not to trust the statistics, so it will make plans using indexes even when it doesn't know that's a good idea; can anyone shed any light on that? |
#2
| |||||||
| |||||||
|
|
On 16.02.2011 20:24, Tom Anderson wrote: We have an application that stores all its information in a database (Oracle 11, as it happens). Between the application and the database sits a third-party object-relational mapper. And not just any sort of object-relational mapper, but one that implements a versioned object model - think of it as source control for objects. A fully operational instance of the application has about 300 000 objects in the database (according to one way of counting them, at least), corresponding to a somewhat larger number of rows (less than a million, i'd guess), spread over a few hundred tables. We like building fresh instances from scratch. We do it as part of our build process, along with compiling the code and so on, to make sure that we can always build a working system from the raw materials in CVS. This process involves clearing out everything in the schema (dropping every object and then purging the recyclebin), running all our DDL, and then loading the 300 000 objects through the versioned mapper. This process is not as fast as we'd like it to be - the DDL is fast, but the data loading takes something like 45 minutes. Let's say, for the sake of argument if nothing else, that we cannot abandon building from scratch, or avoid the versioned mapper, or do anything about the speed or behaviour of the mapper itself. Well, OR-Mappers are typically not very fast since they tend to handle every single instance individually unless there is an explicit batch mode. |
|
What could we try to make the load faster? If your OR-Mapper has a means of verifying a database's data it may be quicker to data pump the data into the database and then check but I guess this is not what you want. |
|
Another variant may be to split up your dataset into independent parts and do multiple concurrent loads. |
|
I don't have detailed statistics to hand, but one thing we've noticed is that there is a very high ratio of selects to upserts during the load; there are numerous (>10) queries made concerning each object before it is inserted - i think the mapper is checking to see if there is an existing version of the object, then inserting the loaded version, performing some sort of merge or check-in operation, etc. We're in the process of tuning our indexes to make them more useful during the load, but a major concern is that the query planner is not making use of them - we quite often see that the most time-consuming queries are ones which should be able to make good use of an index, but are being planned as table scans. Our theory is that this is because the statistics for the tables have not been updated since they were created, and so it looks like they're empty, even as they grow to some vast size. Does that seem plausible? Yes. If so, what can we do about it? We could do a bit of bodging to run dbms_stats.gather_table_stats periodically during the load; would that be sensible? I'd rather take stats after the load, save them somewhere and load them back after you have recreated your schema. |
|
Another option might be to use 11g's new features for plan stability but I can't help you there. There are also some pre 11g features like "stored outlines" which may help. |
|
But maybe it's first reasonable to do a trace in order to find out why the load is slow. If you determine that there are some slow queries which use bad plans you need to tune them. If you find out it's the massive amount of SQL statements fired against the DB the you need to tune that (e.g. faster network, more CPU power on the server...). |
|
Is there some way we can arrange for statistics updates to be triggered automatically as the tables grow? A colleague swears that there's some way to tell the planner not to trust the statistics, so it will make plans using indexes even when it doesn't know that's a good idea; can anyone shed any light on that? I believe he means you should switch to the RBO (rule based optimizer) - which is ancient. IIRC you can enable it per session. |
#3
| |||
| |||
|
#4
| |||
| |||
|
#5
| |||
| |||
|
#6
| |||
| |||
|
|
Gerard: # I've probably missed something, otherwise there would be a very simple solution: drop all constraints (eg. foreign keys) and indexes before the load, and, if necessary, create them afterwards. *That's always been the fastest way to load data. Works well for non partitioned tables and you have the option of dropping or just disabling the foreign keys depending on exactly how they were setup initially. Of course if you need concurrent access to the table(s) while loading it gets more involved. *Typically I would consider getting all the stuff into a staging table ... The volume of data and the timing of handling changes to it is critical to coming up with a complete understanding of what might potentially work. |
#7
| |||
| |||
|
|
On Feb 20, 5:00*pm, John Hurley <hurleyjo... (AT) yahoo (DOT) com> wrote: Gerard: # I've probably missed something, otherwise there would be a very simple solution: drop all constraints (eg. foreign keys) and indexes before the load, and, if necessary, create them afterwards. *That's always been the fastest way to load data. Works well for non partitioned tables and you have the option of dropping or just disabling the foreign keys depending on exactly how they were setup initially. Of course if you need concurrent access to the table(s) while loading it gets more involved. *Typically I would consider getting all the stuff into a staging table ... The volume of data and the timing of handling changes to it is critical to coming up with a complete understanding of what might potentially work. What is the volume of data? If you are on 11gR2, look at exchange table and external tables and "insert into realtable select * from externaltable". This actually uses SQLLOADER under the sheets and can be blazingly fast. Bottlenecks are typically the "mappers" you described. |
|
Just curious, but when doing a data "merge" how do you determine which column get updated and which columns stay the same? You would almost need a timestamp column for each data column... and that would be a real pain. |
|
As for the previous reply to drop constraints etc, that can be very dangerous if you depend on that referential integrity. Indexes (except PK) maybe, but the thought of having to potentially clean up data so you can re-add those constraints is not a very pleasant thought. You would need to **guarantee** that the referential integrity was completed somewhere upstream - which may negate the need for it at the db level. |
![]() |
| Thread Tools | |
| Display Modes | |
| |