dbTalk Databases Forums  

sp_primarykey vs. unique clustered index

comp.databases.sybase comp.databases.sybase


Discuss sp_primarykey vs. unique clustered index in the comp.databases.sybase forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
1dg618
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-21-2011 , 10:19 AM






How does APL, DPL, or DRL factor into the above? I'd make my table
APL?

CI (CustomerId, OrderSaleNo, PartId) --> the records would be in a
sorted order to keep it simple
NCI (PartId, CustomerId, OrderSaleNo) --> the records would be in
some order most likely using the clustered index?

so the foolish person would make the NCI (PartId)? --> there would be
no order therefore the optimizer would avoid using the index? Can you
explain why this would be bad. I think, most people would do it? I
think, in the past, I have done it too thinking it would help
performance, but nobody could explain why it's bad.

If I had NCI (PartId, CustomerId), it would be better than NCI
(PartId)... the optimizer may not use it, but it's a step towards
uniqueness.

One shouldn't worry about the overhead or do the two play well
together?

I'm starting to see the light.

I'm coming from the relational algebra/calculus side. Database design
seems to be drifting away from relationships into some fuzzy
relationship model, which isn't a relationship at all. I know it's
hard for database vendors to adhere to relational algebra because they
have to implement it meaning it's easier on paper than developing a
database server.

I always thought if your data is truly relational, you'd have amazing
performance?

Reply With Quote
  #12  
Old   
Derek Asirvadem
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-21-2011 , 09:17 PM






On Oct 22, 2:19*am, 1dg618 <1dg... (AT) gmail (DOT) com> wrote:
Quote:
How does APL, DPL, or DRL factor into the above? I'd make my table
APL?
If you study the links I provided, you will get the background &
detail, the reasoning. Otherwise I have to re-type everything here.
After all the arguments are had, and the smoke clears, it boils down
to one thing:

APL tables are much more suited for Relational databases, range
queries, composite keys.

DPL/DRL tables are much more suited for record filing systems, no
range queries, surrogate keys.

Quote:
CI * *(CustomerId, OrderSaleNo, PartId) --> the records would be in a
sorted order to keep it simple
Yes

Quote:
NCI *(PartId, CustomerId, OrderSaleNo) --> the records would be in
some order most likely using the clustered index?
Not sure what you mean. If the table has a clustered index (eg.
immediately above, then that is the explicit order of the rows. The
NCI does not change that. The NCI is a B-Tree. All *index entries*
are in the order of the index; here it is (PartId, CustomerId,
OrderSaleNo). The leaf level entries in the NCI hold the CI key; that
is what they use to find the row in the CI. This allows the CI to be
keep trim: when rows are deleted, they are moved on the page, and the
page is trim (no holes); the pages in the extents are kept trim,
etc. But these movements do not affect the NCIs, because they hold
the CI key, not a RowId.

So the rows (living in the CI) are maintained in CI order; the NCI is
in NCI, and containing the CI key.


Quote:
so the foolish person would make the NCI (PartId)? --> there would be
no order therefore the optimizer would avoid using the index?
Read again, I did not say that (PartId) has no order, it has, the
PartId. I said that that is not unique. Tens of thousands of unique
PartIds distributed across millions of OrderSaleItems. The uniqueness
is expressed as a density ratio; it is low, 0.0007. An NCI with
(PartId) alone; say 20,000 across 29,000,000 is not very useful. That
is what nebie (as opposed to foolish) DBAs would do "I am queries on
OrderSaleitem thatfrequently use PartId, so I will add an
NCI(PartId). Then they will spend years trying to figure out why the
NCI does not get used for queries that use (PartId).

But the seasoned DBA who has been through the hoops a few times, will
know that an NCI gets used, the Optimiser finds it useful, when the
NCI is unique, with a density of close to 1.0. So he will add columns
to the NCI to make it unique, knowing that he and the optimiser are
actually interested in just the first column (PartId,...). He will
make the NCI (PartId, CustomerId, OrderSaleNo). Those three cols, are
in the CI which is unique, he has changed the order for a set of
columns that are known to be unique.

Quote:
Can you
explain why this would be bad. I think, most people would do it? I
think, in the past, I have done it too thinking it would help
performance, but nobody could explain why it's bad.
See above. It is a common problem, more of a lack of knowledge than an
outright mistake. it is one thing that I routinely correct on every
P&T assignment, it guaranteed to get happy campers.

Quote:
If I had NCI *(PartId, CustomerId), it would be better than NCI
(PartId)... the optimizer may not use it, but it's a step towards
uniqueness.
Yes, exactly.

And NCI (PartId, CustomerId, OrderSaleNo), is definitely unique, and
guaranteed to be used. In this case, let's say the CI was used for
80% of the queries, then 20% would be table scans when there is:
- No NCI
- NCI (PartId)
- NCI (PartId, CustomerId)

That 20% table scans will change to say 18% index scans when there is:
- NCI (PartId, CustomerId, OrderSaleNo)

Quote:
One shouldn't worry about the overhead or do the two play well
together?
Not sure which OH you are talking about. The NCI is demanded, so the
consideration is between a single-column NCI and a 3-column NCI. The
overhead of an index is considerable. Once you have the index, the
difference between 1 and 3 columns is tiny (except when the columns
are fat/wide/nullable).

The two (CI and useful NCI) go hand in hand, left and right.

Just for your to consider the equivalent in a record filing system,
surrogate keys:
- the surrogate is always an additional column, an additional index,
say (OrderSaleItemId), and they always make that the Placement Index.
Meaningless, because it is the chronological order, which is what the
Heap has anyway.
- they get no data distribution
- REORG REBUILD is
- They need NCI_2(CustomerId) for most queries
- they still need NCI_3(OrderSaleNo),
- and NCI_4(PartId)
- a total of four NCIs,
- no clustering, no range queries
- the slightly faster inserts/deletes have to be balanced against the
demanded REORG REBUILD maintenance (basically offline), and additional
indices
- Plans will show that NCI(OrderSaleNo) is the least used, and they
can drop it


Quote:
I'm starting to see the light.

I'm coming from the relational algebra/calculus side. Database design
seems to be drifting away from relationships into some fuzzy
relationship model, which isn't a relationship at all.
Yes, IT has deteriorated badly, and we now have florists and
gyprockers doing IT work, with no qualifications. The better ones
think they have invented something new, and publish papers about it;
the rest think, wow, and use it. None of them realise that there is
an art to it, and that art is founded in an exact science.

There is no name for thi s"model", it is just un-normalised, often
fraudulently called de-normalised "for performance". Hilarious. De-
normalised implies that (a) Normalisation has been completed, (b)
correctly, and then (c) backed off. These record filing system never
had the normalisation, to be subsequently de-normalised.

Quote:
I know it's
hard for database vendors to adhere to relational algebra because they
have to implement it meaning it's easier on paper than developing a
database server.
Not really. SQL as a language is cumbersome and restricting, and the
progress is very slow. But the big vendors all have Extensions, and
they progress quite fast. DB2 *& Sybase have done a lot to improve
their optimisers, so that even bad queries get optimised well. They
have implemented what I would categorise as Extensions to Codd &
Date's Relational Model.

What we need is a language that accepts relational algebra as source
code, perhaps to sit on top of SQL, since that is the established
base, and it works through all the levels of the system, eliminating
the need to write drivers, etc.

So even without a successor to the RM, and without a modern Rdb
language, the Rdb space has progressed quite a lot.

The biggest problem is not either of those. It is that, in spite of
all that databases can be, the majority of databases implemented are
these tiny isolated record filing systems, that use about 5% of the
Relational capability and about 7% of SQL and the vendors extensions.
Eg. you find very few true corporate db models, they are container-
ised and split into 42 parts. Then the silly buggers create
additional DWs, the import/export utilities, replicate everything.
The technology has progressed but the implementations are still
pre-1984.

Quote:
I always thought if your data is truly relational, you'd have amazing
performance?
Abso-freaking-lutely. I guarantee it in my contracts. I regularly
get 10-12 times (1000 to 1200%) overall improvement in speed (without
changing the box, the o/s, the Sybase version). I get 100 to 1,000s
of times improvement on specific queries, due to correcting the
indexing.

Truly Relational implies Normalised, however, for the reasons stated
above, let me make that explicit. I use 5NF minimum, also identified
in the contract. That means no "update anomalies", zero data
duplication. I do not allow nulls in the Rdb. I use 6NF as needed, to
provide even more speed, functions and capability. of course
normalisation means rewriting the daatabase and thsu rewriting the app
and reports. Now that more people are producing garbage, and more
people are realising the value of normalisation, this type of
assignment is actually increasing.

Regards
Derek

Reply With Quote
  #13  
Old   
1dg618
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-22-2011 , 06:44 PM



What is the difference between the CONSTRAINT / PRIMARY KEY CLUSTERED
in the CREATE TABLE...

CREATE TABLE my_table
(
one_id INT NOT NULL,
two_id INT NOT NULL,
date DATETIME NOT NULL,
source VARCHAR(8) NOT NULL,
CONSTRAINT my_table_pkc_1 PRIMARY KEY CLUSTERED ( date, one_id,
two_id, source ) on 'default'
)
LOCK allpages ON 'default' partition BY roundrobin 1
GO

and creating the UNIQUE CLUSTERED INDEX seperately?

CREATE UNIQUE CLUSTERED INDEX my_table_2 ON my_table
(
date,
one_id,
two_id,
source
)
GO

Are there performance hits?

Would adding a non-clustered index build off of the clustered index?

CREATE NONCLUSTERED INDEX my_table_3 ON my_table
(
one_id,
date,
two_id,
source
)
GO

Does the CLUSTERED index build the foundation for non-clustered
indexes?

I'm using APL.

I created the table and I'm loading data to test it. It seems to be
taking a long time.

Reply With Quote
  #14  
Old   
1dg618
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-26-2011 , 07:55 AM



(1) - Clustered Index: date, first_id, second_id

(2) - Clustered Index: first_id, date, second_id

(3) - Clustered Index: second_id, first_id, date

date: low cardinality

first_id: med. cardinality

second_id: high cardinality

~700 million records in the table

Oracle doesn't care about the order in the unique clustered index.

Sybase seems to care about the order in the unique clustered index.
I'll report more about Sybase after today. My table of about 700
million rows is allpages lock, (constraint) primary key clustered
index (date, first_id, second_id).

Pretty darn quick no date range

Pretty darn quick when I have a date range in my where clause (e.g.
date >= ? and date <= ?)

My other table containing the same data is datapages lock and a unique
clustered index on first_id, second_id, date.

pretty darn quick no date range

Pretty darn slow compared to the other table when I have a date range
in my where clause (e.g. date >= ? and date <= ?)

I loaded the table where the date range query is pretty slow in sorted
order.

I'll try to figure it out today, unless someone knows?

Reply With Quote
  #15  
Old   
Antony
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-26-2011 , 04:15 PM



I think you need to give us the exact SQL you are using, as it doesn't seem obvious (at least to me) what you're really comparing.
Also, can you tell us when the DOL table was last reorg'd? In a DOL table, a clustered index isn't truly clustered, and the efficiency of a range query will decrease over time as rows get deleted/forwarded etc.

Reply With Quote
  #16  
Old   
Derek Asirvadem
 
Posts: n/a

Default Re: sp_primarykey vs. unique clustered index - 10-29-2011 , 12:54 AM



On Oct 23, 10:44*am, 1dg618 <1dg... (AT) gmail (DOT) com> wrote:

Quote:
What is the difference between the CONSTRAINT / PRIMARY KEY CLUSTERED
in the CREATE TABLE...

CREATE TABLE my_table
(
* * * * one_id * * * * INT * * * * * * * NOT NULL,
* * * * two_id * * * * INT * * * * * * * NOT NULL,
* * * * date * * * * * DATETIME * * * * *NOT NULL,
* * * * source * * * * VARCHAR(8) * * * *NOT NULL,
* * * * CONSTRAINT UC_PK PRIMARY KEY CLUSTERED ( date, one_id,
two_id, source ) on 'default'
)
LOCK allpages ON 'default' partition BY roundrobin 1
That is the ANSI/ISO/IEC syntax. It results in an index being created,
which is the same as:
* NOT NULL on all columns
CREATE CLUSTERED INDEX on my_table
ON (date, one_id, two_id, source ) on 'default'

Quote:
and creating the UNIQUE CLUSTERED INDEX seperately?

CREATE UNIQUE CLUSTERED INDEX UC_PK ON my_table
(
* date,
* one_id,
* two_id,
* source
)
Nothing. (I have corrected your index names for clarity)

Quote:
Are there performance hits?
No, because there is no difference.

Quote:
Would adding a non-clustered index build off of the clustered index?
I am not sure what you mean by "build off". The CI Key (your four
columns) forms the leaf level of the CI, the B-Tree is tiny. Rather
than keep the index physically separate from the data, the Sybase AP
CI CLUSTERS the index leaf-level with the data ROW; the heap is
eliminated. So you get a good, sparse, distributed index and data all-
in-one. It is kept very trim, as is, the rows within pages; pages
with extents, etc. Obviously the rows are in physical order by key.

Any NCI for such a CI, contains the CI Key as the leaf-level to the CI-
data. This allows the rows to be moved without having to update the
NCIs.

Whereas in a DPL/DRL table, the heap always remains, and the NCIs are
always physically separate structures. The leaf-level in those NCIs
are a ROW_ID, not a CI key (they cannot have CIs)

Quote:
CREATE NONCLUSTERED INDEX NC_1 ON my_table
(
* one_id,
* date,
* two_id,
* source
)

Does the CLUSTERED index build the foundation for non-clustered
indexes?
I think my expalanation above should answer that. If what you call
"foundation", is a single, CI structure, containing Index+data, very
fast, and sparse, then Yes.

Quote:
I created the table and I'm loading data to test it. It seems to be
taking a long time.
1. Well, use one of the faster techniques. SELECT-INTO, vectors, BCP,
etc.

2. "A long time" is a relative term, what are you comparing it with,
and identical DPL/DRL table ? Traffic at 5:00 pm ? at 2:00 am ?

On Oct 26, 11:55*pm, 1dg618 <1dg... (AT) gmail (DOT) com> wrote:

Quote:
(1) - Clustered Index: date, first_id, second_id

(2) - Clustered Index: first_id, date, second_id

(3) - Clustered Index: second_id, first_id, date

date: low cardinality

first_id: med. cardinality

second_id: high cardinality

~700 million records in the table

Oracle doesn't care about the order in the unique clustered index.
That's because the Oracle clustered index is completely different to
the Sybase CI. The Oracle IOT is more like the Sybase CI.

Quote:
Sybase seems to care about the order in the unique clustered index.
Yes, I believe I already explained that.

Sybase actually does not care. But the order will make a difference
based on the query that you give Sybase to process.

Quote:
I'll report more about Sybase after today. My table of about 700
million rows is allpages lock, (constraint) primary key clustered
index (date, first_id, second_id).
SO I am not sure what your question is, it appears you have (1), but
you would like to know abut (2) and (3). Or have you been trying (2)
and (3) on your table.

If you have many interspersed inserts, do not forget to drop/create
the CI.

Quote:
Pretty darn quick no date range
Guaranteed, as per my explanation.

Quote:
Pretty darn quick when I have a date range in my where clause (e.g.
date >= ? and date <= ?)
That is a Range Query, yes.

Quote:
My other table containing the same data is datapages lock and a unique
clustered index on (first_id, second_id, date).
Ain't no such thing, as explained in detail. The syntax required may
confuse you into thinking you have a "clustered index". You don't.
It is a plain old NCI.

Quote:
pretty darn quick no date range
Don't know what the sargs are, but, Ok.

Quote:
Pretty darn slow compared to the other table when I have a date range
in my where clause (e.g. date >= ? and date <= ?)
Guaranteed, as well. As I have detailed in my explanation, DPL/DRL
tables cannot support range queries the way APL/CI can.

SET STATISTICS IO ON, and get an idea of hwat it is doing.

Also, read this thread again, please.

Also, read Part A of the second linked doc above.

Quote:
I loaded the table where the date range query is pretty slow in sorted
order.
It is getting a bit confusing. Perhaps you can give the tables stable
names, so that we know which one you mean. I presume you mean the DPL
table, with ~700 million rows. You may need to clear that up, build
the Placement index without SORTED_ORDER.

Quote:
I'll try to figure it out today, unless someone knows?
Figure what out ? The question is not stated.

Regards
Derek

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.