Relational Databases Considered Harmful

My first real job, at the actual company called AIRS and the original owner of the domain airs.com, was working on what was then called a fourth generation language system. Nowadays we would simply call it a database. It was pretty powerful, and we had customers, but we only ran on a system called the Alpha Micro which ran a proprietary operating system called AMOS. So that’s why you’ve never heard of AIRS, or if not the only reason at least a sufficient one.

Anyhow, what I wanted to say here is, we had a database, but it was not a relational database. It was what then called a network database. Nowadays, we would simply say “huh?” But in fact I thought then and continue to think now that relational databases are basically a mistake.

The idea behind relational databases, as invented back in the ’70s by E.F. Codd at IBM, is that you can access them via a structured query language. We now know that language at SQL. A relational database is conceptually just a set of tables. SQL lets you express a database search by conceptually joining tables together in a few different ways (the most common way being matching some index field) and then describing some condition to be satisfied on the joined table. SQL is nice because you can compose complex queries in a fairly logical manner.

The problem is that data in the real world doesn’t typically come in table format. In order to put it into table format you have to insert all various key fields. That’s not so bad, actually, but when you have one-to-many or many-to-many relationships, SQL requires that you describe them as table joins. When the tables get large, actually joining a table becomes expensive–prohibitively expensive. So instead the database has to figure out an efficient way to do the query which doesn’t involve doing an actual join, but instead involves traversing the tables in some manner collecting the desired results. When the query is at all complicated, that becomes a hard problem. So for practical use relational database implementations require you to define various indexes which they can use, as well as more sophisticated solutions which I won’t go into.

The end result is that when you anticipate a user’s query and create the appropriate indexes, the query is fast. When the user does something you didn’t anticipate, the query is slow–for a large database, prohibitively slow. The user who doesn’t know which indexes you created has no idea which queries will be fast and which will be slow. So SQL gives you this logically powerful query language which is supposed to be simple to use, but in actual practice has traps for the unwary user.

In a network database, you have to express your one-to-many and many-to-many relationships explicitly. And when you do a query, you have to follow the links defined by the database–there are no joins. The result is that an efficient query is easy to write, and an inefficient query is hard to write. This makes it harder to do data mining, but it’s actually not that bad. You can still support a structured query language, even SQL itself, you just use a system which rewrites it into something which can be implemented before it gets to the database rather than after. (At AIRS we had a programming language for queries which we called Logic, but most people used a visual query creator, with saved standard queries, etc.).

I can’t argue that SQL is really nice to use. But I can and do argue that relational databases are not the right way to organize data for efficient access.


Posted

in

by

Tags:

Comments

3 responses to “Relational Databases Considered Harmful”

  1. rskrishnan Avatar
    rskrishnan

    Could’nt agree more – i.e. relational databases suck – with no doubt whatsoever … at least in my mind.

    It starts out very nicely – with a “logical” organization to the data schema, but pretty soon you land up with a rats nest of indices, and “flattened tables” which all need to be kept in sync. So the problem goes from “hey here’s a black box to put my data into” … to … something like … “if you’re loading a LOT of data, then turn off index consistency checks, since they are slow – then load data, then update your indices”, or … “if your report is large go to flattened_table_xyz – not to the clean structure, and by the way we update the flattened table only once a week” !!

    More interestingly – I don’t think the rdbms ever delivered on it’s promise of “your data all the time, anytime”.

    Every reasonably large company that I know of spends many thousands of $$ on buying servers, disks, networks, and software — all to run a measly GL/AR/AP accounting system … and then guard access to the system like it was their first born!

  2. Ian Lance Taylor Avatar

    Exactly. Thanks for the note.

  3. eternaleye Avatar
    eternaleye

    Came across this while reading through your archives after finding your site via your excellent linker internals posts. I think that in this view of “how databases should be” you would rather like the SPARQL/RDF system that is becoming popular for the “semantic {web,desktop,etc.}”. Check out Nepomuk in KDE 4 and the Virtuoso RDF database engine/server that it’s built on these days.

Leave a Reply