Intelligent Enterprise Magazine
View Content | Advertise | Subscribe | Resources | Vendor Resources | Interact | Contact Us | About Us
Intelligent Performance
IntelligentCRM
Intelligent Applications
Intelligent Integration
Intelligent Portals
Analytic Applications
Business Intelligence
Database
Data Integration
Data Warehousing
Enterprise Development
Storage
Supply Chain
IE Imperatives
Archives
Realware Awards
Career Center
Research Library
Event Calendar
Editorial Calendar
Advanced Search
Translate
May 11, 1999, Volume 2 - Number 7

An analysis of Codd's contribution to the Great Debate

Thirty Years of Relational: Relational Really Is Different

By C.J.Date

The Great Debate was a debate between proponents of the relational and network approaches. It was held at the ACM SIGMOD Workshop on Data Description, Access, and Control in 1974; the principal speakers were Edgar F. Codd for the relational approach (surprise!) and Charles W. Bachman for the network, or CODASYL, approach. In preparing for the debate, Codd wrote a paper entitled "Interactive Support for Nonprogrammers: The Relational and Network Approaches"1, and it's that paper that I want to discuss here.

Note: The paper is shown in the debate proceedings as being a joint production by Codd and myself; in fact, it was wholly written by Codd. (The companion paper,2 on application programming considerations, which is also attributed to both of us, was written by myself.)

OVERVIEW OF THE PAPER

Of course, the battle between relations and networks is ancient history now. (The good guys won.) This fact notwithstanding, Codd's paper -- even though it was written over 25 years ago -- is still worth reading today as a beautiful example of clear thinking. Indeed, it's quite remarkable to see how, on a topic where muddled thinking was the norm at the time, Codd was able to do such a good job of cutting to the chase and focusing on the real underlying issues. Let me elaborate:

  • First of all, Codd realized that to compare the very concrete CODASYL specifications and the much more abstract relational model would be an apples-and-oranges comparison and would involve numerous distracting irrelevancies.

  • Hence, it would be necessary first to define an abstract "network model." The comparison could then be done on a level playing field, as it were, in a fair and sensible manner.

  • Codd therefore proceeded to define an abstraction of the CODASYL specifications that might reasonably be regarded as such a model. (And then, of course, he went on to compare that abstraction with the relational model.)

Thus, Codd has some claim to being the first person to attempt to give an abstract definition, not just of the relational model (of course), but also of a network model! Certainly none of the CODASYL documents ever attempted any such thing. (In fact, the paper might still be timely after all, even though the original battle is over. Certainly we are -- regrettably -- seeing some of the same tired old issues surfacing again in connection with object DBMSs. As several writers have observed, object DBMSs do tend to look like "CODASYL warmed over" in certain respects. A case of those who don't know history being doomed to repeat it?)

Anyway, the most significant contribution in "Interactive Support for Nonprogrammers: The Relational and Network Approaches" is probably his introduction of the concept of essentiality, a concept that's critical to a proper understanding of data models in general and relations vs. networks in particular. It's the concept of essentiality that allows us to pinpoint the crucial difference between relational databases and others, as we'll see in the next section. (Indeed, it's really the main reason I want to talk about the paper in this series at all.)

As well as introducing the essentiality notion, Codd’s paper also raises a number of pertinent questions regarding the suitability of network structures as a component of what it calls "the principal schema" -- questions that, so far as I know, have never been answered in the open literature. To quote: "In the past, many designers of software systems... have confused two quite distinct notions: enrichment of features on the one hand, and generality of application on the other." How true! "A crucial issue in database management systems is that of the richness (that is, variety) of data structures... that should be supported in the principal schema. In the event that enrichment of these data structures ... beyond the minimum is proposed, we ask the following questions..." (I don't want to discuss the questions themselves in detail here, however, since I've already considered them at some length in an earlier paper3).

Incidentally, the foregoing quote reminds me of another one, which appeared in Codd's tutorial paper on normalization4: "Several existing systems permit a variety of physical representations for a given logical structure... The complexity of the physical representations which these systems support is, perhaps, understandable, because these representations are selected in order to obtain high performance... What is less understandable is the trend toward more and more complexity in the data structures with which... users directly interact. Surely, in the choice of logical data structures that a system is to support, there is one consideration that is of absolutely paramount importance -- and that is the convenience of the majority of users." Again, how true!

Back to "Interactive Support." The paper also includes an appendix comparing CODASYL/COBOL and relational/Alpha versions of a simple machine shop scheduling application. That comparison provides strong evidence in support of relational claims of simplicity (simplicity for the user, that is). For full details of the comparison, including the actual code used in the two solutions, I refer you to the original paper; here I content myself with simply repeating a few remarks from an earlier paper of mine5 that discusses the same example. Here are a few comparative statistics:

  CODASYL relational
GO TO 15 0
PERFORM UNTIL 1 0
currency indicators 10 0
IF 12 0
FIND 9 0
GET 4 1
STORE / PUT 2 1
MODIFY 1 0
MOVE CURRENCY 4 0
other MOVEs 9 1
SUPPRESS CURRENCY 4 0
total statements > 60   3

The relative simplicity of the relational solution is very striking. Note: In fact, the relational solution could have been reduced to just a single statement, a PUT; the GET and MOVE aren't strictly necessary. What's more (although Codd doesn't mention the fact), the CODASYL "solution" -- which was taken from another source, by the way, not created by Codd himself -- included at least two bugs!

The example highlights another point, too. To quote Codd: "The reader is cautioned to avoid comparing [different] approaches solely on the basis of differences in [data structure]. An adequate appreciation of the differences ... must entail consideration of the... operators also." And later he adds: "Discussions on the relational approach often become riveted on the [data structure] component to the neglect of [other components]. To do justice to this approach, all... components must be considered as a package."

ESSENTIALITY

Although this concept is so important, it's my experience that few database professionals are really familiar with it. This installment therefore has more of a tutorial flavor than others. Note: The material that follows is based in part on another earlier paper of mine.6

First of all, "everyone knows" that the only data structure available in the relational model is the relation itself. To understand the significance of this point, however, it's necessary to know something about at least one other data structure: for example, the link structure found in hierarchic and network systems. So let's look at an example. Fig. 1 shows (a) the relational design for a simple departments-and-employees database, together with (b) a network equivalent of that design. (Actually, the example is so simple that the network design degenerates to a mere hierarchy, but the point's not important for present purposes. Hierarchies and networks are much more like each other than either one is like relations.)



Fig. 1: Departments and employees: relational and network designs


The network design involves a "CODASYL set" (not to be confused with a mathematical set!). Each occurrence of that CODASYL set consists of a DEPT row, a set of corresponding EMP rows, and an occurrence of a link ("DEPTEMP") that connects those DEPT and EMP rows. (I use the term "row" here rather than the more usual "record" in order to avoid distractions caused by mere terminological differences between the two approaches.) Within a given CODASYL set occurrence, the corresponding link occurrence can be thought of as a chain of pointers -- a pointer from the DEPT row to the first EMP row for that DEPT, a pointer from that EMP row to the next for the same DEPT, and so on, and finally a pointer from the last EMP row back to the original DEPT. Note: Those "pointers" needn't be physically represented in storage by actual pointers, but the user can always think of them as actual pointers. (That's the network model!)

Observe now that the EMP rows in the network design don't include a DEPT# column. Thus, to find what department a given employee is in, we have to traverse the DEPTEMP link occurrence from the applicable EMP to the corresponding DEPT; likewise, to find the employees in a given department, we have to traverse the DEPTEMP link occurrence from the applicable DEPT to the corresponding EMPs. In other words, the information that was represented by a foreign key in the relational design is represented by a link in the network design; links are the network analog of foreign keys (speaking very loosely).

Now let's consider a couple of queries against this database. For each query, I'll show a relational (SQL) formulation and a network equivalent, using a hypothetically extended version of SQL that caters for links.

Q1: Get employee numbers and employee names for employees with salary greater than 20K.

Relational: Network:
 
SELECT EMP#, ENAME SELECT EMP#, ENAME
FROM EMP FROM EMP
WHERE SALARY > 20K ; WHERE SALARY > 20K ;


Q2: Get employee numbers and employee names for employees with salary greater than 20K in department D3.

 
Relational: Network:
SELECT EMP#, ENAME SELECT EMP#, ENAME
FROM EMP FROM EMP
WHERE SALARY > 20K WHERE SALARY > 20K
AND DEPT# = 'D3' ; AND ( SELECT DEPT#
  FROM DEPT      
  OVER EMP ) = 'D3' ;

For query Q1 the two formulations are obviously identical; for query Q2, however, they're not. The relational formulation for Q2 still has the same basic form as for Q1 (SELECT - FROM - WHERE, with a simple restriction condition in the WHERE clause); the network formulation, by contrast, has to use a new language construct, the OVER clause (which is my hypothetical SQL representation of a link-traversing operation). The WHERE condition in that formulation is certainly not a simple restriction condition.

Q2 thus illustrates the important point that networks fundamentally require certain additional data access operators. Note too that those operators are additional; the relational operators are still needed as well, as query Q1 shows. Note moreover that this point applies not only to all of the data manipulation operators (update operators included), but also to the definitional operators, the security operators, the integrity operators, and so on. The links of the network data structure thus certainly add complexity. However, they don't add any power -- there's nothing that can be represented by a network that can't be represented by relations, and there's no query that can be answered from a network and not from relations.

Now, it's sometimes suggested that the complexity can be reduced by reinstating the DEPT# component (the foreign key) in EMP, as shown in Figure 2. This redesign allows query Q2 (network version) to be formulated without using the OVER construct; in fact, the formulation becomes identical to its relational counterpart. The reason is, of course, that DEPT and EMP in that revised design are identical to their relational analogs; the database is now identical to its relational equivalent, except for the DEPTEMP link. However, that link is wholly redundant -- it doesn't represent any information that isn't also represented by the foreign key, and we can therefore ignore it without losing any logical functionality.

Fig. 2: Departments and employees: network design with foreign key EMP.DEPT# reinstated.

So now (at last!) I can explain the notion of essentiality. A data construct is essential if its loss would cause a loss of information -- by which I mean, very precisely, that some relation would no longer be derivable. For example, in the relational version of departments-and-employees, all data constructs (all rows and all columns) are essential in this sense. Likewise, in the original network version (Figure 1), all data constructs (all rows, all columns, and all links) are again essential. But in the revised network of Figure 2, the rows and columns are essential but the link is inessential. There's no information that can be derived from that network that can't be derived from the rows and columns alone; there's no logical need to use the link at all.

Note: Some people might argue that the opposite is the case -- the link is essential and the foreign key is inessential. But that argument misses the point, which is that, since some rows and columns must be essential, and nothing else need be, then why have anything else?

Now (at last) I can pin down the crucial difference between a relational database and any other kind, say a network database. In a relational database, the only essential data construct is the relation itself. In other databases, there must be at least one additional essential data construct (such as an essential link). If there isn't, then the database is really a relational database that happens to have certain access paths exposed. (And there's no requirement that the user use those access paths, and the question arises as to why they're exposed anyway when others aren't.) And it's those additional essential data constructs that lead to much (not all) of the complexity of nonrelational databases.

Note: In the case of CODASYL specifically, there are at least four additional constructs that can also be used to carry information essentially. To get into details of those other constructs here would take us much too far afield, however.

A NOTE ON ORDERING

The concept of essentiality allows me to explain why it's significant that relations have no ordering to their rows. In an ordered file, the ordering itself might be essential in the sense explained above. For example, a file of temperature readings might be kept in the order in which those readings were taken; the ordering itself might thus carry information, which would be lost if the records of the file were rearranged -- just as information can be lost if someone drops a box of cards, if those cards don't include a sequence field. (If you're too young to know what boxes of cards and sequence fields are all about, you can ignore this example!) And essential ordering, like an essential link, requires additional operators to deal with it -- "find the nth record," "insert a record between records n and n+1," and so on. For this reason it's not permitted in the relational model.

In contrast to the foregoing, it's sometimes suggested that inessential ordering might be acceptable. A file is inessentially ordered if it's ordered on the basis of the value(s) of some field(s) -- for example, the employee file might be ordered by employee number, but no information would be lost if the records were shuffled around. Some "relational" systems do in fact support ordering in this sense. Note, however, that relations per se are unordered by definition; it would really be better to regard an "ordered relation" as a totally different kind of thing -- perhaps as, precisely, a sequential file. In this regard, the SQL ORDER BY operation might best be thought of as converting a relation into such a file, rather than as "ordering a relation."

In any case, even inessential data constructs can cause problems, because they do still carry information, even though they're inessential. For example, they might represent a security exposure. Suppose a file of employee records is ordered (inessentially) by increasing salary. Then the fact that your manager's record appears after your own certainly tells you something, even if you're not authorized to see actual salary values.

CONCLUDING REMARKS

In conclusion, let me get back to Codd's "great debate" paper. Codd observes, wisely, that it's "not enough to take note of where [the] approaches stand today -- we must also be clear about where they're going." He contends that "casual" users of database systems will become increasingly important (in fact, they'll soon be the largest user constituency of all), and so we must pay attention to what the different approaches are or might be doing to support such users. Some quotes:

  • "Such users clearly need a simple logical notion of the data organization in order to frame their queries or modifications in a simple way."

  • "The absence in the network approach of operators at the algebraic or calculus level is noteworthy, since such operators play a vital role in supporting [casual user] interaction... [Indeed, the] absence from the [CODASYL proposals] of any specific objectives for the support of nonprogrammer interaction is especially noteworthy."

  • "General-purpose support for such users entails provision of an augmented, relationally complete, retrieval capability without branching, explicit iteration, or cursors. It's clear how this capability can be realized with the relational approach -- whether with a formal or informal language interface. It's not at all clear how the network approach can reach this goal, so long as the principal schema includes [CODASYL sets] bearing information essentially." Note: The "augmentation" Codd had in mind here had to do with the availability of library functions for counting, summing, and so on.

These observations need only minor changes in wording for them to be still highly pertinent today.

REFERENCES

1. Codd, E. F. Codd and C. J. Date. "Interactive Support for Nonprogrammers: The Relational and Network Approaches." IBM Research Report RJ1400 (June 6th, 1974). Republished in Randall J. Rustin (ed.), Proc. ACM SIGMOD Workshop on Data Description, Access, and Control, Vol. II, Ann Arbor, Michigan (May 1974). Also in C. J. Date, Relational Database: Selected Writings. Reading, Mass.: Addison-Wesley (1986).

2. Date, C. J. and E. F. Codd. "The Relational and Network Approaches: Comparison of the Application Programming Interfaces." IBM Research Report RJ1401 (June 6th, 1974). Republished in Randall J. Rustin (ed.), Proc. ACM SIGMOD Workshop on Data Description, Access, and Control, Vol. II, Ann Arbor, Michigan (May 1974). Also in C. J. Date, Relational Database: Selected Writings. Reading, Mass.: Addison-Wesley (1986).

3. Date, C. J. "Support for the Conceptual Schema: The Relational and Network Approaches." In C. J. Date, Relational Database Writings 1985-1989. Reading, Mass.: Addison-Wesley (1990).

4. Codd, E. F. "Normalized Data Base Structure: A Brief Tutorial." Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control, San Diego, Calif. (November 11th-12th, 1971).

5. Date, C. J. "Why Relational?" In C. J. Date, Relational Database Writings 1985-1989. Reading, Mass.: Addison-Wesley (1990).

6. Date, C. J. "Essentiality." In C. J. Date, Relational Database Writings 1991-1994. Reading, Mass.: Addison-Wesley (1995).



C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto, coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. Correspondence may be sent to him in care of Intelligent Enterprise, 411 Borel Ave. #100, San Mateo, CA 94402.



<a href="http://web.archive.org/web/20050307024052/http://as.cmpnet.com/event.ng/Type=click&amp;FlightID=28155&amp;AdID=44294&amp;TargetID=1424&amp;Segments=1411,1867,3448&amp;Targets=1424,2878&amp;Values=34,46,51,63,77,80,92,101,140,205,442,646,974,1045,1047,1054,1184,1311,1388,1431,1785,1901,1925,1945,1970,2217,2299,2310,2352,2678,2767&amp;RawValues=&amp;Redirect=http://www.sunopsis.com/us/intent/diicsq0904.asp" target="_top"><img src="http://web.archive.org/web/20050307024052im_/http://i.cmpnet.com/ads/graphics/as5/aj/IntEnt_square_125_125.gif" width="125" height="125" border="0"></a>


Editors' Choice Awards Copyright © 2004 CMP Media LLC
Privacy Statement · Terms Of Service
a United Business Media Company
ALL RIGHTS RESERVED
No reproduction without permission

Readers' Choice Awards 2004
Intelligent Performance Conference & Expo 2004
Home | Privacy Policy | Feedback | Subscribe | CMP Media LLC