On Oct 22, 2:19*am, 1dg618 <1dg... (AT) gmail (DOT) com> wrote:
Quote:
How does APL, DPL, or DRL factor into the above? I'd make my table
APL? |
If you study the links I provided, you will get the background &
detail, the reasoning. Otherwise I have to re-type everything here.
After all the arguments are had, and the smoke clears, it boils down
to one thing:
APL tables are much more suited for Relational databases, range
queries, composite keys.
DPL/DRL tables are much more suited for record filing systems, no
range queries, surrogate keys.
Quote:
CI * *(CustomerId, OrderSaleNo, PartId) --> the records would be in a
sorted order to keep it simple |
Yes
Quote:
NCI *(PartId, CustomerId, OrderSaleNo) --> the records would be in
some order most likely using the clustered index? |
Not sure what you mean. If the table has a clustered index (eg.
immediately above, then that is the explicit order of the rows. The
NCI does not change that. The NCI is a B-Tree. All *index entries*
are in the order of the index; here it is (PartId, CustomerId,
OrderSaleNo). The leaf level entries in the NCI hold the CI key; that
is what they use to find the row in the CI. This allows the CI to be
keep trim: when rows are deleted, they are moved on the page, and the
page is trim (no holes); the pages in the extents are kept trim,
etc. But these movements do not affect the NCIs, because they hold
the CI key, not a RowId.
So the rows (living in the CI) are maintained in CI order; the NCI is
in NCI, and containing the CI key.
Quote:
so the foolish person would make the NCI (PartId)? --> there would be
no order therefore the optimizer would avoid using the index? |
Read again, I did not say that (PartId) has no order, it has, the
PartId. I said that that is not unique. Tens of thousands of unique
PartIds distributed across millions of OrderSaleItems. The uniqueness
is expressed as a density ratio; it is low, 0.0007. An NCI with
(PartId) alone; say 20,000 across 29,000,000 is not very useful. That
is what nebie (as opposed to foolish) DBAs would do "I am queries on
OrderSaleitem thatfrequently use PartId, so I will add an
NCI(PartId). Then they will spend years trying to figure out why the
NCI does not get used for queries that use (PartId).
But the seasoned DBA who has been through the hoops a few times, will
know that an NCI gets used, the Optimiser finds it useful, when the
NCI is unique, with a density of close to 1.0. So he will add columns
to the NCI to make it unique, knowing that he and the optimiser are
actually interested in just the first column (PartId,...). He will
make the NCI (PartId, CustomerId, OrderSaleNo). Those three cols, are
in the CI which is unique, he has changed the order for a set of
columns that are known to be unique.
Quote:
Can you
explain why this would be bad. I think, most people would do it? I
think, in the past, I have done it too thinking it would help
performance, but nobody could explain why it's bad. |
See above. It is a common problem, more of a lack of knowledge than an
outright mistake. it is one thing that I routinely correct on every
P&T assignment, it guaranteed to get happy campers.
Quote:
If I had NCI *(PartId, CustomerId), it would be better than NCI
(PartId)... the optimizer may not use it, but it's a step towards
uniqueness. |
Yes, exactly.
And NCI (PartId, CustomerId, OrderSaleNo), is definitely unique, and
guaranteed to be used. In this case, let's say the CI was used for
80% of the queries, then 20% would be table scans when there is:
- No NCI
- NCI (PartId)
- NCI (PartId, CustomerId)
That 20% table scans will change to say 18% index scans when there is:
- NCI (PartId, CustomerId, OrderSaleNo)
Quote:
One shouldn't worry about the overhead or do the two play well
together? |
Not sure which OH you are talking about. The NCI is demanded, so the
consideration is between a single-column NCI and a 3-column NCI. The
overhead of an index is considerable. Once you have the index, the
difference between 1 and 3 columns is tiny (except when the columns
are fat/wide/nullable).
The two (CI and useful NCI) go hand in hand, left and right.
Just for your to consider the equivalent in a record filing system,
surrogate keys:
- the surrogate is always an additional column, an additional index,
say (OrderSaleItemId), and they always make that the Placement Index.
Meaningless, because it is the chronological order, which is what the
Heap has anyway.
- they get no data distribution
- REORG REBUILD is
- They need NCI_2(CustomerId) for most queries
- they still need NCI_3(OrderSaleNo),
- and NCI_4(PartId)
- a total of four NCIs,
- no clustering, no range queries
- the slightly faster inserts/deletes have to be balanced against the
demanded REORG REBUILD maintenance (basically offline), and additional
indices
- Plans will show that NCI(OrderSaleNo) is the least used, and they
can drop it
Quote:
I'm starting to see the light.
I'm coming from the relational algebra/calculus side. Database design
seems to be drifting away from relationships into some fuzzy
relationship model, which isn't a relationship at all. |
Yes, IT has deteriorated badly, and we now have florists and
gyprockers doing IT work, with no qualifications. The better ones
think they have invented something new, and publish papers about it;
the rest think, wow, and use it. None of them realise that there is
an art to it, and that art is founded in an exact science.
There is no name for thi s"model", it is just un-normalised, often
fraudulently called de-normalised "for performance". Hilarious. De-
normalised implies that (a) Normalisation has been completed, (b)
correctly, and then (c) backed off. These record filing system never
had the normalisation, to be subsequently de-normalised.
Quote:
I know it's
hard for database vendors to adhere to relational algebra because they
have to implement it meaning it's easier on paper than developing a
database server. |
Not really. SQL as a language is cumbersome and restricting, and the
progress is very slow. But the big vendors all have Extensions, and
they progress quite fast. DB2 *& Sybase have done a lot to improve
their optimisers, so that even bad queries get optimised well. They
have implemented what I would categorise as Extensions to Codd &
Date's Relational Model.
What we need is a language that accepts relational algebra as source
code, perhaps to sit on top of SQL, since that is the established
base, and it works through all the levels of the system, eliminating
the need to write drivers, etc.
So even without a successor to the RM, and without a modern Rdb
language, the Rdb space has progressed quite a lot.
The biggest problem is not either of those. It is that, in spite of
all that databases can be, the majority of databases implemented are
these tiny isolated record filing systems, that use about 5% of the
Relational capability and about 7% of SQL and the vendors extensions.
Eg. you find very few true corporate db models, they are container-
ised and split into 42 parts. Then the silly buggers create
additional DWs, the import/export utilities, replicate everything.
The technology has progressed but the implementations are still
pre-1984.
Quote:
I always thought if your data is truly relational, you'd have amazing
performance? |
Abso-freaking-lutely. I guarantee it in my contracts. I regularly
get 10-12 times (1000 to 1200%) overall improvement in speed (without
changing the box, the o/s, the Sybase version). I get 100 to 1,000s
of times improvement on specific queries, due to correcting the
indexing.
Truly Relational implies Normalised, however, for the reasons stated
above, let me make that explicit. I use 5NF minimum, also identified
in the contract. That means no "update anomalies", zero data
duplication. I do not allow nulls in the Rdb. I use 6NF as needed, to
provide even more speed, functions and capability. of course
normalisation means rewriting the daatabase and thsu rewriting the app
and reports. Now that more people are producing garbage, and more
people are realising the value of normalisation, this type of
assignment is actually increasing.
Regards
Derek