dbTalk Databases Forums  

Re: Statistics during a complex load

comp.databases.oracle.server comp.databases.oracle.server


Discuss Re: Statistics during a complex load in the comp.databases.oracle.server forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Robert Klemme
 
Posts: n/a

Default Re: Statistics during a complex load - 02-18-2011 , 02:07 PM






On 16.02.2011 20:24, Tom Anderson wrote:

Quote:
We have an application that stores all its information in a database
(Oracle 11, as it happens). Between the application and the database
sits a third-party object-relational mapper. And not just any sort of
object-relational mapper, but one that implements a versioned object
model - think of it as source control for objects. A fully operational
instance of the application has about 300 000 objects in the database
(according to one way of counting them, at least), corresponding to a
somewhat larger number of rows (less than a million, i'd guess), spread
over a few hundred tables.

We like building fresh instances from scratch. We do it as part of our
build process, along with compiling the code and so on, to make sure
that we can always build a working system from the raw materials in CVS.
This process involves clearing out everything in the schema (dropping
every object and then purging the recyclebin), running all our DDL, and
then loading the 300 000 objects through the versioned mapper.

This process is not as fast as we'd like it to be - the DDL is fast, but
the data loading takes something like 45 minutes.

Let's say, for the sake of argument if nothing else, that we cannot
abandon building from scratch, or avoid the versioned mapper, or do
anything about the speed or behaviour of the mapper itself.
Well, OR-Mappers are typically not very fast since they tend to handle
every single instance individually unless there is an explicit batch mode.

Quote:
What could we try to make the load faster?
If your OR-Mapper has a means of verifying a database's data it may be
quicker to data pump the data into the database and then check but I
guess this is not what you want.

Another variant may be to split up your dataset into independent parts
and do multiple concurrent loads.

Quote:
I don't have detailed statistics to hand, but one thing we've noticed is
that there is a very high ratio of selects to upserts during the load;
there are numerous (>10) queries made concerning each object before it
is inserted - i think the mapper is checking to see if there is an
existing version of the object, then inserting the loaded version,
performing some sort of merge or check-in operation, etc.

We're in the process of tuning our indexes to make them more useful
during the load, but a major concern is that the query planner is not
making use of them - we quite often see that the most time-consuming
queries are ones which should be able to make good use of an index, but
are being planned as table scans. Our theory is that this is because the
statistics for the tables have not been updated since they were created,
and so it looks like they're empty, even as they grow to some vast size.
Does that seem plausible?
Yes.

Quote:
If so, what can we do about it? We could do a
bit of bodging to run dbms_stats.gather_table_stats periodically during
the load; would that be sensible?
I'd rather take stats after the load, save them somewhere and load them
back after you have recreated your schema. Another option might be to
use 11g's new features for plan stability but I can't help you there.
There are also some pre 11g features like "stored outlines" which may help.

But maybe it's first reasonable to do a trace in order to find out why
the load is slow. If you determine that there are some slow queries
which use bad plans you need to tune them. If you find out it's the
massive amount of SQL statements fired against the DB the you need to
tune that (e.g. faster network, more CPU power on the server...).

Quote:
Is there some way we can arrange for
statistics updates to be triggered automatically as the tables grow? A
colleague swears that there's some way to tell the planner not to trust
the statistics, so it will make plans using indexes even when it doesn't
know that's a good idea; can anyone shed any light on that?
I believe he means you should switch to the RBO (rule based optimizer) -
which is ancient. IIRC you can enable it per session.

Docs are on

http://tahiti.oracle.com/

Kind regards

robert

F'up to comp.databases.oracle.server

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Reply With Quote
  #2  
Old   
Tom Anderson
 
Posts: n/a

Default Re: Statistics during a complex load - 02-20-2011 , 07:42 AM






On Fri, 18 Feb 2011, Robert Klemme wrote:

Quote:
On 16.02.2011 20:24, Tom Anderson wrote:

We have an application that stores all its information in a database
(Oracle 11, as it happens). Between the application and the database
sits a third-party object-relational mapper. And not just any sort of
object-relational mapper, but one that implements a versioned object
model - think of it as source control for objects. A fully operational
instance of the application has about 300 000 objects in the database
(according to one way of counting them, at least), corresponding to a
somewhat larger number of rows (less than a million, i'd guess), spread
over a few hundred tables.

We like building fresh instances from scratch. We do it as part of our
build process, along with compiling the code and so on, to make sure
that we can always build a working system from the raw materials in CVS.
This process involves clearing out everything in the schema (dropping
every object and then purging the recyclebin), running all our DDL, and
then loading the 300 000 objects through the versioned mapper.

This process is not as fast as we'd like it to be - the DDL is fast, but
the data loading takes something like 45 minutes.

Let's say, for the sake of argument if nothing else, that we cannot
abandon building from scratch, or avoid the versioned mapper, or do
anything about the speed or behaviour of the mapper itself.

Well, OR-Mappers are typically not very fast since they tend to handle every
single instance individually unless there is an explicit batch mode.
True. I don't believe this mapper has a batch mode, BICBW.

Quote:
What could we try to make the load faster?

If your OR-Mapper has a means of verifying a database's data it may be
quicker to data pump the data into the database and then check but I
guess this is not what you want.
I'm afraid not. The source of the data is XML files, produced by hand or
by some data-migration tools we've written to pull data in from other
sources, and checked into source control. The mapper turns these files
into objects, and then puts them in the database. The table layout changes
fairly quickly (we add columns and tables to hold new information), but
because the mapper sits between the files and the tables, we don't have to
touch the files themselves when we make schema changes (mostly). Using a
table-level dump would, i believe, lose that rather nice feature.

Nonetheless, this is something we should think about more. It may turn out
that the work of regenerating dumps on schema change may be less than the
work of rebuilding the entire database from scratch through the mapper.
Hmm.

Quote:
Another variant may be to split up your dataset into independent parts
and do multiple concurrent loads.
We're already doing that!

Quote:
I don't have detailed statistics to hand, but one thing we've noticed is
that there is a very high ratio of selects to upserts during the load;
there are numerous (>10) queries made concerning each object before it
is inserted - i think the mapper is checking to see if there is an
existing version of the object, then inserting the loaded version,
performing some sort of merge or check-in operation, etc.

We're in the process of tuning our indexes to make them more useful
during the load, but a major concern is that the query planner is not
making use of them - we quite often see that the most time-consuming
queries are ones which should be able to make good use of an index, but
are being planned as table scans. Our theory is that this is because the
statistics for the tables have not been updated since they were created,
and so it looks like they're empty, even as they grow to some vast size.
Does that seem plausible?

Yes.

If so, what can we do about it? We could do a bit of bodging to run
dbms_stats.gather_table_stats periodically during the load; would that
be sensible?

I'd rather take stats after the load, save them somewhere and load them back
after you have recreated your schema.
In the time since i asked this question, some other guys on my team have
also come up with this idea. I'm a little bit dubious about it, because
the statistics will be for an earlier version of the built database, which
may not have exactly the same structure or content as the one being built.
But still, they should be similar enough that the hand-me-down statistics
are useful.

Quote:
Another option might be to use 11g's new features for plan stability but
I can't help you there. There are also some pre 11g features like
"stored outlines" which may help.
Thanks, i'll look those up.

Quote:
But maybe it's first reasonable to do a trace in order to find out why
the load is slow. If you determine that there are some slow queries
which use bad plans you need to tune them. If you find out it's the
massive amount of SQL statements fired against the DB the you need to
tune that (e.g. faster network, more CPU power on the server...).
There do seem to be bad plans. Queries which should be planned as index
scans are being planned as table scans. We think this is because the
statistics indicate the table is small enough that a table scan is
preferable.

Quote:
Is there some way we can arrange for statistics updates to be triggered
automatically as the tables grow? A colleague swears that there's some
way to tell the planner not to trust the statistics, so it will make
plans using indexes even when it doesn't know that's a good idea; can
anyone shed any light on that?

I believe he means you should switch to the RBO (rule based optimizer) -
which is ancient. IIRC you can enable it per session.
Okay. Worth looking into, at least.

Thanks for your help.

tom

--
Tech - No Babble

Reply With Quote
  #3  
Old   
John Hurley
 
Posts: n/a

Default Re: Statistics during a complex load - 02-20-2011 , 12:27 PM



Tom:

# Is there some way we can arrange for statistics updates to be
triggered automatically as the tables grow? A colleague swears that
there's some way to tell the planner not to trust the statistics, so
it will make plans using indexes even when it doesn't know that's a
good idea; can anyone shed any light on that?

That's a deep and complicated subject. One option is to delete the
table statistics with cascade set to true ( the default ) so that
index stats also are ditched ... then use dynamic sampling.

Lots of variations there between letting Oracle make decisions on when
to gather statistics or periodically having an algorithm run that
makes a decision for you about new statistics.

Reply With Quote
  #4  
Old   
Gerard H. Pille
 
Posts: n/a

Default Re: Statistics during a complex load - 02-20-2011 , 01:35 PM



I've probably missed something, otherwise there would be a very simple solution: drop all
constraints (eg. foreign keys) and indexes before the load, and, if necessary, create them
afterwards. That's always been the fastest way to load data.

Reply With Quote
  #5  
Old   
John Hurley
 
Posts: n/a

Default Re: Statistics during a complex load - 02-20-2011 , 04:00 PM



Gerard:

# I've probably missed something, otherwise there would be a very
simple solution: drop all constraints (eg. foreign keys) and indexes
before the load, and, if necessary, create them afterwards. *That's
always been the fastest way to load data.

Works well for non partitioned tables and you have the option of
dropping or just disabling the foreign keys depending on exactly how
they were setup initially.

Of course if you need concurrent access to the table(s) while loading
it gets more involved. Typically I would consider getting all the
stuff into a staging table ...

The volume of data and the timing of handling changes to it is
critical to coming up with a complete understanding of what might
potentially work.

Reply With Quote
  #6  
Old   
onedbguru
 
Posts: n/a

Default Re: Statistics during a complex load - 02-20-2011 , 07:59 PM



On Feb 20, 5:00*pm, John Hurley <hurleyjo... (AT) yahoo (DOT) com> wrote:
Quote:
Gerard:

# I've probably missed something, otherwise there would be a very
simple solution: drop all constraints (eg. foreign keys) and indexes
before the load, and, if necessary, create them afterwards. *That's
always been the fastest way to load data.

Works well for non partitioned tables and you have the option of
dropping or just disabling the foreign keys depending on exactly how
they were setup initially.

Of course if you need concurrent access to the table(s) while loading
it gets more involved. *Typically I would consider getting all the
stuff into a staging table ...

The volume of data and the timing of handling changes to it is
critical to coming up with a complete understanding of what might
potentially work.
What is the volume of data? If you are on 11gR2, look at exchange
table and external tables and "insert into realtable select * from
externaltable". This actually uses SQLLOADER under the sheets and can
be blazingly fast. Bottlenecks are typically the "mappers" you
described.

Just curious, but when doing a data "merge" how do you determine which
column get updated and which columns stay the same? You would almost
need a timestamp column for each data column... and that would be a
real pain.

As for partitioning - look at interval partitions (also new in 11gR2)
where there is no need to pre-create the partitions - this is really
cool. In our testing we also discovered that even if we added a zero
statistic value to the new partitions, performance on queries was more
than acceptable.

As for the speed, I would look at something like use the external
table definition, then run an insert/update procedure (that would do
all of the checking using a join rather than point select
statements.) Of course, when you have to "merge" data, that can be a
REAL pain.

As for the previous reply to drop constraints etc, that can be very
dangerous if you depend on that referential integrity. Indexes (except
PK) maybe, but the thought of having to potentially clean up data so
you can re-add those constraints is not a very pleasant thought. You
would need to **guarantee** that the referential integrity was
completed somewhere upstream - which may negate the need for it at the
db level.

Reply With Quote
  #7  
Old   
Tom Anderson
 
Posts: n/a

Default Re: Statistics during a complex load - 02-25-2011 , 10:57 AM



On Sun, 20 Feb 2011, onedbguru wrote:

Quote:
On Feb 20, 5:00*pm, John Hurley <hurleyjo... (AT) yahoo (DOT) com> wrote:
Gerard:

# I've probably missed something, otherwise there would be a very
simple solution: drop all constraints (eg. foreign keys) and indexes
before the load, and, if necessary, create them afterwards. *That's
always been the fastest way to load data.

Works well for non partitioned tables and you have the option of
dropping or just disabling the foreign keys depending on exactly how
they were setup initially.

Of course if you need concurrent access to the table(s) while loading
it gets more involved. *Typically I would consider getting all the
stuff into a staging table ...

The volume of data and the timing of handling changes to it is
critical to coming up with a complete understanding of what might
potentially work.

What is the volume of data? If you are on 11gR2, look at exchange
table and external tables and "insert into realtable select * from
externaltable". This actually uses SQLLOADER under the sheets and can
be blazingly fast. Bottlenecks are typically the "mappers" you
described.
The use of the mapper is definitely the bottleneck here. But the use of
the mapper is largely non-negotiable, i'm afraid. I know exactly what
reaction i'd get if i walked up to my project manager and suggested
ditching it and doing our dataload in pure SQL.

Quote:
Just curious, but when doing a data "merge" how do you determine which
column get updated and which columns stay the same? You would almost
need a timestamp column for each data column... and that would be a real
pain.
We don't ever do a merge like that. But the mapper has to check to make
sure that we're not. If we did, it would punt to its merge-handling
process, which in the interactive system is a fancy web application, and
in the load is an error!

Quote:
As for the previous reply to drop constraints etc, that can be very
dangerous if you depend on that referential integrity. Indexes (except
PK) maybe, but the thought of having to potentially clean up data so you
can re-add those constraints is not a very pleasant thought. You would
need to **guarantee** that the referential integrity was completed
somewhere upstream - which may negate the need for it at the db level.
The problem is that the mapper does a lot of queries during the load.
Going without the indexes is a complete non-starter.

tom

--
Who would you help in a fight, Peter van der Linden or Bill Gates?

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.