RE: Useful Oracle books - C.J. Date theory vs. practicality

  • From: Dan Tow <dantow@xxxxxxxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Thu, 27 May 2004 19:35:15 -0500

I have a few thoughts regarding Date's theory versus practicality:

In all, Date's great, and taught me a lot, all well and good, but I completely
sympathize with anyone who thinks he at least occasionally neglected
practicality. My main case in point is his position on nulls in _An
Introduction to Database Systems_ (6th Ed.), I quote (p. 123):

"...incorporation of nulls or any similar feature into the model is premature at
this time."

Date made some good points about how confusing nulls can be, and all the
troubles that relate to them. It is important to understand that null might
mean "We don't know this property of this entity." OR it might mean "This
property does not *apply* to the entity." (Spouse_ID of a Persons table, for a
single person, being a good example.) I'd personally like to see extensions to
SQL to handle these subtleties. The subtleties are there, however, whether we
forbid nulls or not, and forbidding nulls will only add to the confusion! Can
any of you actually imagine building a practical RDBMS design without nulls? (I
admit I haven't read everything Date wrote, so it is possible he
moderated the position elsewhere, or at least addressed the practicalities he
ignored in the aforementioned text.)

Imagining a null-less database: Two options:

1) Everywhere you have  column that would currently be nullable, declare, for
that column, some *special* value to handle "we don't know" or "does not apply"
(or perhaps two values...), and then we'll all diligently remember to keep
track of what that value is for every such column in the DB, and to build
special-case logic throughout our application so that it doesn't do horrendous
things like sum those special-case values along with the *real* values, ...

but wait, sometimes there *is* no value that could never be real for a given
column, so we'll have to add whole extra Boolean columns for each such
no-longer-nullable column to handle specifying whether we really ought to be
ignoring the value in that other column over there, and we build special logic
into the application to handle refering to these special Boolean
pseudo-null-indicating columns almost everywhere we refer to the
no-longer-allowed-null columns,...

but wait, what about the column that starts out seemingly safe for designating,
say, "-1" as the special "does not apply" value, but ends up in some special
case needing "-1" for a *real* value - WHOOPS, I guess we better revert to
another value, or to the special Boolean pseudo-null-indicating columns, and
we'll have to fix all that special logic that formerly did special handling on
"-1",...

But wait, what about the column that starts out seeming not to *need* "does not
apply" or "we don't know", but later is discovered to ocasionally need this
feature - WHOOPS, I guess we'll have to go back and change all the application
logic that uses theis column to take account of either some special value or
some new Boolean pseudo-null-indicating column. (Admittedly, you might *also*
need to change *some* of your application logic if you made a not-null column
nullable, but at least *some* of it would likely work without recoding in that
scenario.)...

Uugghh.

OK, quite likely Date would not have advocated inventing a whole lot of *new*
values that *mean* null, to replace the old value NULL, so let's look at option
2:

Everywhere we would currently have a nullable column, let's invent a new
sub-entity (zero-to-one)-to-one with the original entity to avoid nulls. So, if
we have a Persons table, which originally had a nullable Spouse_ID column,
we'll migrate the Spouse_ID column to a new
Persons_Who_We_Know_Are_Married_And_Whose_Spouse_IDs_We_Know table, which has
just the columns Person_ID and Spouse_ID, and then we never need a null
Spouse_ID in the table. (We will, however, need a great many outer joins, which
will assuredly create NULL values in query results, unless we break current
simple, single queries into vast numbers of separate queries that handle the
different cases separately!) Imagine a perfectly realistic situation with 30
different columns that can each *independently* hold a null value, today:
You'll need *30* new tables to handle each subset that is not-null for that
column, and where you would have selected all 30 columns, you'll need a 30-way
outer join (on top of however many joins you'd need for all the *other* columns
of all the *other* tables!) I can say with some authority that the result would
be slower and harder to tune, not to say harder to write and maintain! ...

But wait, a column that we think will never need a null value might later be
found to need nulls, so we'll just have to create these new subset tables every
time such a case arises, and we'll have to modify our code like crazy to handle
every such case, as in the first option...

Uugghh, again!

Sure, nulls are confusing, and associated with all sorts of logic problems, but
those problems are largely *inherent* in the underlying logical problem, and
pretending they are not helps no one, and taking away the tool of nulls would
hugely confuse the still-real-and-unavoidable underlying issues!

Did I beat that dead horse sufficiently? Michael mentioned Date's no-duplicates
rule. For this one, I have much more sympathy. I believe there is a special
case, though, where duplicates are reasonable and practical, which I'll
illustrate with a concrete example, first, then generalize:

Imagine that you want data from bar-code readers at a supermarket to go straight
into a database. The natural tuple (to use Date's terminology) would look
something like (Transaction_ID, Bar_Code, Item_Count) (assuming that we don't
care about details like what order the items were read, or the precise
date-time of each code-reading event, beyond the date-time of the transaction
as a whole, which we store separately in a Transactions table), but wait, that
tuple won't handle (without duplicates or special logic) a case where the
cashier fails to group all identical items together and read the code on one
item while entering the correct Item_Count separately. Again, there seem to be
two solutions within Date's rules:

#1) Make the application logic (or the RDBMS itself) check, for every row coming
in, whether the combination (Transaction_ID, Bar_Code) is new, and do an insert
if it is, else an update (incrementing the Item_Count) if it is not. (Yes, this
could be handled more gracefully with a new "UPSERT" or extended MERGE
operation that hid the complexity from the coder - nice feature - but the
underlying extra work remains.)

OR

#2) Create a new, purely artificial key Read_Event_ID that is different for
every code-reading event.

The latter solution is the most common, and is usually a good solution for this
general kind of problem, but consider this: Here, we truly do not care
about the individual "entities" (Read_Events, I suppose) that this table
represents - we care *only* about the data in *aggregate*, the sum over the
transaction, the sum over a given Bar_Code, for a given day at a given
store,... Given this, we have no need whatsoever to ever refer to a single row
of the table, uniquely. Another way to put this (to generalize the problem) is:

-We have a table of rows that only matter in aggregate.
-The indiviudal rows of the table never require updates.
-There is no detail table (nor is there likely to be) that stores
still-more-detailed data that needs to reference this table as a master table
with a foreign key.
-It is simpler (or at least faster) to aggregate a few more rows, which were
inserted with simple INSERT operations, without checking for uniqueness
violations, than to perform UPDATE/INSERT ("UPSERT") operations that
pre-aggregate the rows at the lowest useful level of aggregation. The example
is claerly like this - the reduction in rowcount if we aggregated on
(Transaction_ID, Bar_Code) would be trivial compared to the cost of maintaining
uniqueness on (Transaction_ID, Bar_Code) using UPSERT operations, or adding a
wholly useless Read_Event_ID key that resulted in no rowcount consolidation at
all.

This general case doesn't come up all that often, I think, but it does happen.
The main thing is that the entity being tracked must be so trivial that there
is virtually no chance that you'll *even later* want to point to the rows
individually, or especially to update the rows or to create a details table
under the current table.

Dan Tow
Author, SQL Tuning, O'Reilly
650-858-1557
www.singingsql.com


Quoting Michael Milligan <Michael.Milligan@xxxxxxxxxxx>:

> Anthony,
>
> Excellent remarks. However, you seem to pitting theory against practicality
> in a way that does not allow them to coexist. As Date would say, "Theory is
> practical!" That is actually one of his chapter names. Your example of
> preventing duplicates is a great case in point. What is the meaning of a
> result set with duplicates? What can you do with it? If I see three
> identical rows, meaning there is nothing returned to differentiate them, how
> can I, or my client use that information? I may as well just do a count.
>
> What you find anal I find brilliant. Every "i" is dotted and every "t" is
> crossed in Date's logic. It has symmetry and completeness I have rarely
> found elsewhere.
>
> Look at it this way: you may argue that there are times when we should break
> away from the theoretical to get the practical accomplished. But every time
> you "break away" from the theory, you making an exception you will have to
> keep track of and account for. Yes, you can duplicate data to prevent joins.
> But make sure you have a trigger to handle the extra update, etc., etc.
> Sometimes it's worth it, sometimes it gets out of hand.
>
> I enjoyed your remarks.
>
> Thanks,
>
> Mike
>
> -----Original Message-----
> From: Anthony Molinaro [mailto:amolinaro@xxxxxxxx]
> Sent: Thursday, May 27, 2004 3:54 PM
> To: oracle-l@xxxxxxxxxxxxx
> Subject: RE: Useful Oracle books
>
>
> Mike,
> Date's textbooks are very good, but FUNDAMENTALS OF DATABASE SYSTEMS by
> Navathe and Elmasri is much better (imho). I agree that his relational
> database writings books are excellent, but slightly anal. Some of his
> ideas just would not be practical in the real word. He loves taking
> shots at sql, which is fine, but some of his suggestions are
> unreasonable.
>
> For example, relations, by definition, should not have duplicates.
> Relational algebra operations (on paper!) can perform selection and=20
> projection operations on these relations and do not return duplicate
> tuples.=20
> He would like sql to do the same...=20
> Think about the overhead the sql language would to have to incur to
> always select distinct under the covers, on every table, on every join.
> I agree with his views on nulls but not sure his "special value" or
> "default value" scheme is much better though.
>
> As for Celko, I feel he's more concerned with the practical nature of
> databases/sql, not theory; whereas Date wants to put theory into
> practice. In any case, without the use (or knowledge) of analytics Celko
> has written some nifty sql. =20
>
> Ryan - break out an old math text and brush up on set theory, after that
> you'll find relational algebra a breeze
>
>  - ant
>

----------------------------------------------------------------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
----------------------------------------------------------------
To unsubscribe send email to:  oracle-l-request@xxxxxxxxxxxxx
put 'unsubscribe' in the subject line.
--
Archives are at //www.freelists.org/archives/oracle-l/
FAQ is at //www.freelists.org/help/fom-serve/cache/1.html
-----------------------------------------------------------------

Other related posts: