RDBMS Data models

- D
- D Yuniskis
  
  Contact options for registered users
posted
13 years ago

Thu, Oct 21, 2010 2:36 PM

Hi,

That which follows illustrates a common situation I find myself facing when building data models. I'm looking for comments on hidden shortcomings of my solution (so I don't have to redesign and rebuild all the tables).

I'll express everything in pseudo-C to avoid any SQL details (and blemishes! :> ). It should also make the concepts easier for folks w/o SQL backgrounds to follow. (keep in mind, though, the obvious and desirable pkey/fkey that aren't explicitly disclosed, here -- plus the additional criteria typically imposed on the relations).

I like using address books for examples as they are usually clean, simple ways to present data models while eliminating the need to be familiar with any particular application domain.

Naive approach:

typedef struct { ... char street[MAX_STREET_LEN+1]; char city[MAX_CITY_LEN+1]; char state[MAX_STATE_LEN+1]; ... } Address;

Address addresses[] = { ... };

This has obvious problems for nontrivial databases.

[I'll omit the elipsis henceforth to cut down on size]

A simple optimization:

typedef struct { char city[MAX_CITY_LEN+1]; } City;

typedef struct { char state[MAX_STATE_LEN+1]; } State;

typedef struct { char street[MAX_STREET_LEN+1]; City *city; State *state; } Address;

Enforcing referential integrity is *technically* possible -- but doesn't buy you the *real* protection you want, here (e.g., "Boston, OK"?). So, a better implementation:

typedef struct { char city[MAX_CITY_LEN+1]; } City;

typedef struct { char state[MAX_STATE_LEN+1]; } State;

typedef struct { City *city; State *state; } Locality;

typedef struct { char street[MAX_STREET_LEN+1]; Locality *locality; } Address;

[N.B. This is how I designed *my* address book (it's efficient for trivial City's -- as you would likely encounter in a *real* "address book"); inertia has kept me from actually fixing it! :< ]

But, I think the "right" solution is more along the lines of:

typedef struct { char city[MAX_CITY_LEN+1]; State *state; } City;

typedef struct { char state[MAX_STATE_LEN+1]; } State;

typedef struct { char street[MAX_STREET_LEN+1]; City *city; } Address;

I think this is conceptually (i.e., in the sense of the relationships of the data) correct. But, still can be improved on for efficiency sake though that tends to push it more towards the previous ("locality") solution :-/ The models start to look like "spaghetti bowls" pretty quickly, though...

N.B. I'm not too worried about denormalization. Rather, trying to get the models "intuitive" without being too cumbersome/inefficient.

[ perhaps I need to construct an even more illustrative example? Or, is the class of problem apparent here?]

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Oct 22, 2010 2:45 AM

[Grrrr... I should have tagged each typedef with a unique name -- so we knew *which* "City" we were talking about :< ]

No, I mean "city" not "cityname". A "city" (City) would have attributes like: name, population, lat/lon, primary area code, etc. but only in a datamodel that folds that information *into* the "City" declaration; e.g., in my *flawed* address book implementation, "city" is little more than a "city name" -- "locality" pairs a city with a "state" (which, in my case, is always unique; i.e., you could use city,state as a primary key) and, as such, can carry the attributes I mentioned. At the same time, "city" could not -- as "Berlin" would be a valid "city" yet would not, by itself, uniquely identify a *particular* "Berlin" ("Springfield" being a more ambiguous example).

"State"s are always unique (maybe not elsewhere in the world but in my address book, I can assure you there are only a few score "states" and all of them are unique :> ). "City"s are unique when taken in the context of "State"s -- whether you do that by including a reference to a State in a City's definition or by pairing the City (name) with a State in a "Locality" definition.

Streets are fluff -- not really part of the example except to suggest that an address has a "street address", "city" and "state" (note that I haven't mentioned ZIP codes, countries, apartment numbers, etc. -- all sorts of things that muddy what is intended as a simple example)

I don't really care about addresses. It was just something that, I thought, would be easier for folks to relate to and extrapolate the ideas/examples I was illustrating.

Typos might not prevent mail from being delivered; but, having a Chicago and a Shikago in the City relation surely indicates one (or both) of these records are in error -- as well as casting suspicion on any that refer to either of them. E.g., are Mayor Daley, Chicago and Mayor Daley, Shikago two different people? If not, how do I get mail to *the* Mayor Daley -- which address (if either) is correct?

Anticipating future needs is where models always break. "Oh, I didn't expect you to want that sort of information" or "I didn't thinkn you would be using it in that way"

Experiences with the evolution of my address book uncovered flaws in that initial data model (e.g., "locality" was a bad idea that just "fixed" a problem that the original relations had manifested). I'm now trying to bridge the differences between an OO (programming language) implementation and cataloging those objects *in* an RDBMS and using the RDBMS as the ORB, in effect. So, a "purist" approach to building the relations tends to lead to bad/clumsy performance (heavier queries than might otherwise be needed).

Thanks, I'll take a look!

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Oct 22, 2010 5:25 AM

i think it is based on the class of application. The "database crowd" is rarely as concerned about resources as the folks here. Whether that's space or execution speed, "they" usually work in a much richer environment (and might, for example, not place as much emphasis on factoring common "strings" out of different relations, the cost of a JOIN, etc.)

No, I have a char array for a *member* of a City (I should have avoided using case as a distinguishing feature of names :< ). Note that I removed the ellipsis from all but the first example (indicating it should be assumed thereafter -- however, what the ellipsis *represented* in each case varies as the data structures change to fit the different variations of that example).

Nor is it a list of population, geographical area, etc. But, all those things pertain to a (particular) City.

The City is identified by a pointer to the City struct. A *human* might refer to a city by it's name -- but the

*code* knows which City it is concerned with.

In C++, "name" is a member of the City class. As it pertains to a City, name is a textual handle that human beings use when refering to a city. In most cases, a city's *name*, by itself, is insufficient to disambiguate among the set of all City's (cases like Chicago, New York City, etc. -- in the US -- are exceptions to this).

What I am doing is mapping a more traditional (e.g., C++) implementation onto an RDBMS. I.e., the members of a particular class become the attributes in a particular relation. This has implementation advantages in some applications above and beyond "bits residing in a struct".

However, in the RDBMS implementation, you might opt to factor things differently than you would in a (e.g.) C++ implementation.

Note that if I declared "name" (in C++) as const char, a compiler could conceivably use a single representation of the string "Springfield" (e.g.) shared among all City's that coincidentally happen to share that "name".

Doing so in an RDBMS suggests me to put the name someplace and have other relations *reference* it -- alternatively, I can put copies of the name (text string) in each of those other relations. I.e., more of the optimization falls on my explicit choices in the data model.

The "locality" example in my flawed address book took it's initial form from this observation. E.g., here's a list of UNIQUE city names; here's a list of UNIQUE state names. When prompting the user for a city name, his response must agree with one of those listed in the City table; similarly for State. If one of these has been specified, the choices for the other are further constrained to coincide with an appropriate tuple in the "locality" table. (this makes sense if you think it through)

This was expedient, at the time. As my definitions of "what constitutes a city" and "what constitutes a state" grew, I was forced to treat "localities" as I should have treated cities -- clearly I couldn't add a "population" attribute to the city table as the CITY NAME that it had trivially presented was ambiguous in that regard.

(I could have solved my problem by renaming the city table "citynames" and the locality table could have become "cities"...)

Correct. Each of the examples I presented allows for "Berlin" to be qualified appropriately to represent Berlin, WI; Berlin, CT; etc. Where I put the *population* of "Berlin" varies depending on the example cited (e.g., in the City struct in some case, in the Locality struct in other)

Ah, but that;s the problem! E.g., I can use any of the models I presented to represent the data unambiguously "for an address book". But, each has different access costs depending on the data you want and how you want to use it.

I can dereference a pointer in a few clock cycles in a C++ implementation. The equivalent JOIN is much more expensive :-/

OTOH, denormalizing everything is incredibly wasteful of resources.

Note that the interface I present to the user can remain consistent, regardless. E.g., I could write INSERT/UPDATE triggers to automatically factor and/or copy the information into the right tables.

But that depends on the integrity of the data! E.g., if the (bogus) "Shikago" entry was referenced, it might have completely incorrect data associated with "Shikago" which would make delivery impossible (unless it was being mailed from someplace where the Postal Workers recognized the *intended* recipient -- "S. Claus, N. Pole").

Yes. That's a big reason for the "unnecessary" complexity in the example. Much simpler just to go with a flat relation and *hope* all the blanks get filled in properly.

Exactly. OTOH, certain practices have unavoidable costs associated with them. E.g., forcing another level of indirection in a reference (whether you are talking about machine language, a HLL or SQL) adds certain costs while saving on other costs.

And, the form of the model can be friendly or intimidating to the user. E.g., a "flat" address book is more friendly to parse by a user interested in auditing its contents. Granted, you could present this illusion via a suitable set of JOINs and/or VIEW... but, if the underlying implementation relies on many relations, you then have an added problem (and opportunity!) for that user: e.g., if he discovers Mayor Daley's "Shikago" reference and opts to delete it, what do "we" do with the *other* "Shikago" references? (the simplest case for the developer is to leave all of this up to the user -- that makes for a pretty crappy device, though! This is yet another reason why the "database crowd" is a bad fit for this discussion as "end users" rarely have to maintain "those" databases)

I'll put in a request at my local library. Thanks!

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Oct 22, 2010 10:03 AM

You're confusing Street with StreetName. A StreetName can and does occur more than once in a given city. Likewise City and CityName. Not only can a CityName occur more than once in a state (Queensland has two towns called Texas!), a single city can span state boundaries. There are a lot more details in addresses if you want to model the real world accurately, but the question you should ask yourself is how much of the real-world complexity matters to you. You're only making a model, right? So figure out what you need, which in this case I suspect is "to allow physical mail to be delivered" and store the *names* of streets, cities and states, and never mind about the actual streets, cities and states. Even typos in names will normally not prevent the mail from getting delivered.

If you're interested in learning more about how to model information, visit my site, dedicated to fact oriented modeling (free tools available!), at

formatting link

Clifford Heath, Data Constellation,

formatting link

Agile Information Management and Design.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Oct 22, 2010 2:09 PM

You have a char array for a city. You can't write a city. It is not an array of characters. You can only write the name of a city. This reflects the widespread problem of confusion between the thing, and the facts we use to identify the thing.

When you start paying attention to the facts that identify a thing, you realise where your simplifying assumptions will start to break down.

The idea of an "attribute" is a *construct*, that we infer when we think there's a many->one or one->one relationship between two things (a city, and a name, for example). When you start talking about attributes, you're exposing your prejudices about the cardinality of that relationship.

In fact-oriented modeling, we model the relationship, and then model the cardinality as a separate constraint. There are no attributes in natural semantics.

Only once all such relationships and constraints are modeled can we map to an effective O-O or relational model. Addition of a single fact type or constraint can cause the whole mapping to be re-evaluated.

"Berlin" is a string of characters that represents a name. That name may apply to one or more cities. If you mean it to apply to only one city, you must make that *simplification* explicit. In this way you always know what you mean by your data, and you have made explicit decisions about what real- world situations you are choosing to ignore.

It's not that the model broke down; it's that you over-simplified.

Now you're making your assumptions explicit. That's the right path.

You have to make simplifying decisions that will work for you. That is all. I suspect that mail to either Mayor Daley would get delivered, if that's what you care about. If you care about something different, you have to decide how you'll draw the line between correct and incorrect data (based on some behaviour you want to prevent or enforce), and you have to check that.

It's only a *model*. You have to decide how realistic it must be to serve the purpose you have for it.

No - making the wrong simplifying assumptions is where the model... fails to match new needs. It doesn't break, it just doesn't match any more.

This is why software is hard; it doesn't just interface with the real world, but with people's changing and often unstated assumptions about the real world. It's not a hard science like electronic engineering is, and can never be.

Fact-oriented modeling, or as I call it, "semantic modeling" makes all these assumptions explicit, and I believe it offers some hope that software will improve.

A classic book on the difficulty of modeling the real world is "Data and Reality" by William Kent. Inexpensive, and an easy read.

Clifford Heath, Data Constellation,

formatting link

Agile Information Management and Design.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Oct 26, 2010 11:03 AM

D Yuniskis,

I revisited your original post, and have to apologise because I realise I responded without taking time to properly understand what you were asking, and picked nits instead. I'll use a more concise shorthand here, and XYZname when I mean the name, rather than the entity. First I'll restate your descriptions in my terms:

Your naive approach had:

Address(streetname, cityname, statename)

You then optimise to:

City(cityname) State(statename) Address(streetname, City, State)

and then:

City(cityname) State(statename) Locality(City, State) Address(streetname, Locality)

and finally, your "right" solution:

State(statename) City(cityname, State) Locality(City, State) Address(streetname, City)

I don't. But I think it may be adequate. You need to decide what is adequate for your purpose, as I was pointing out before. The real world is complicated!

But the point I really wanted to make here, is that what you decide as being *conceptually adequate* is not likely to be the same as the *most efficient*.

Conceptual models are made up of many small pieces, being individual fact types. This is what you describe as spaghetti - a good description of our actual thought structures. It must not be seen as a bad thing - it just isn't what you want to *store*. That is, it's really important to get the conceptual model right, then decide how to store it efficiently.

The process of mapping to a physical storage model requires making decisions about uniqueness. If you are happy that one City can only be in one State (even though an actual city might span a state border) then you don't need a many-to-many for the city-state relationship. Since each many-to-many will be a table, that's an important simplification.

The Rmap procedure described in Terry Halpin's seminal fact-oriented modeling book "Information Modeling and Relational Databases" shows how to go from a pure conceptual model to an efficient physical model. Basically the uniqueness constraints determine where "absorption" is possible, reducing the number of tables to the minimum capable of representing your concepts.

As an example from an application I built for the car insurance industry (the model is published on my site), a "Claim" follows from one Incident, and each Claim may only cover one Incident (a 1:1 relationship). The Incident may or may not have occurred while the car was being driven (another 1:1, optional this time, named "Incident involved Driving"). If this was the case, there was one Driver ("Driving was by Driver", another 1:1). The Driver may have been subject to a Breath Test (another optional 1:1), and if so, we may or may not know the result of that test (another optional 1:1).

So, conceptually we have this chain: Claim->Incident->

Driving->Driver->BreathTest->BreathTestResult.

Physically, we have a Claim table with a "BreathTestResult" field. It would be silly to create joined tables for all those relationships, yet they *are* relevant and do have an impact on the application's user interface and behaviour.

The final comment is that for storage on disk, we use the maximum degree of absorption possible, consistent with the rules of normalisation - even breaking those rules on occasion. In RAM, where locality is not such an issue, we can use the more detailed structure without much cost, and that can make it easier to build the application's UI. The same model doesn't work for both disk and RAM, and that gives rise to the "object/relational mapping" problem. Google for "the Vietnam of Computer Science" for a very interesting article on that subject. (BTW, fact-oriented modeling has a definitive answer to the quandary he describes, using an approach he had not considered).

I hope that explains my position better, and you find it useful.

Clifford Heath, Data Constellation,

formatting link

Agile Information Management and Design.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Oct 27, 2010 6:53 PM

"Locality" is abandoned, here as City serves the same role.

What do you suggest as an alternative?

Correct. Especially as "efficient" varies based on the criteria being used as well as the implementation.

Exactly. Though the latter imposes constraints on the former.

But, even cities (using lowercase) that span borders are treated as two different cities. E.g., a resident of Kansas City, MO wouldn't consider himself to be a resident of Kansas City, KS. Likewise, "Oak Street" (fictitious name) in KC, MO need have no bearing to "Oak Street" in KC, KS.

I guess I don't see how a city straddling a state (or even

*country* boundary -- Nogales, AZ vs. Nogales, Sonora) is any different from two cities located in the centers of their respective "states". A police officer in KC, MO has no jurisdiction in KC, KS. Sales taxes paid on purchases made in KC, MO are never seen in KC, KS. Mail addressed to KC, KS has a different ZIP code than mail addressed to KC, KS. etc.

It was this aspect of addresses that made them appealing as an illustrative example -- folks can relate to the concept of an "address", as such. (??)

Understood.

Yet, had you opted to add "Breath Test Result" at a later date, you could have opted to do so as a linked table instead of adding BreathTestResult to the original table.

E.g., I don't have a "SSN" field for each "Person" in my (contact) database as:

- I only need to know the SSN's of a few people (so why burden the "main table" with a "NULL" entry for most SSNs?)

- the concept is sufficiently different from the basic role of the "contacts" that it can merit a "SSNs" table

- it gives me better control over access to that information

Exactly. And, since the DB is "backed" in "offline storage" (i.e., doesn't permanently reside "in RAM" -- even if that RAM is ROM/FLASH), it tends to be accessed through more expensive mechanisms.

The problem is bigger (? different??) than that. Those issues are largely the provenance of the DBA (et al.). He fits *some* solution to the dataset "in some way".

The problem I'm trying to address is how the "end user" (DB-illiterate) interacts with those models. The goal of an application developer is to make that application as friendly and usable as possible by its intended audience (while still ensuring accuracy, timeliness, etc.).

E.g., To the user, there is no difference between a

- Person

- PersonWithSSN

- PersonWithUnknownSSN

- PersonWhoWillNeverHaveAnSSN in my (weak) SSN example. Indeed, most users will never realize there *are* these different types of "Persons".

A more cryptic example: I have an application that tracks "parcels". They have attributes like "identifier", "weight", "type", "location", etc. E.g.,

Parcel(10698553, 27.3, "Ostomy bags", "Medical Storage", ...)

Yet, some Parcels are ephemeral -- existing only for a handful of seconds (i.e., just long enough to be weighed). Some identifiers are "reusable". etc.

To the user, the different classes of "Parcels" are never consciously considered -- they are intuited. Exposing the differences to the user (in the UI) would only confuse him. He *know* what he expects a Parcel to be IN A GIVEN CONTEXT, even if he is not consciously aware of the differences

*between* those contexts.

Thanks, I'll have to stew on this some more. I'll dig through the references you've provided and see if there are any additional insights to be gained.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 26, 2010 6:32 PM

Just finished. The typesetting was reminiscent of my days with "runoff" (limited troff workalike) but bearable (jeez, did we used to think that was "The Cat's Meow"?).

As you said, an easy read (no idea as to cost as I pulled it from my public library).

He makes interesting points regarding the various ambiguities in language/terminology, etc. And, how much the "models" are geared towards the processing of data rather than reflecting reality, etc.

Unfortunately, he (admittedly) has no real solutions to offer; just highlights the problem areas. Sounds like he is trying to write "The Seminal Book on Modelling" and this was just a rough draft/work-in-progress.

It *did*, however, drive me to the conclusion to ignore the representational/efficiency aspects and focus on the data that I am storing (without complicating the design choices with those other issues).

And, to explicitly limit the scope of the applications that I want my models to address -- lest they get too "warped" trying to accommodate things unforseen (i.e., by being too general). [after all, every computing resource gets less expensive with time]

Thanks!

--don

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Nov 28, 2010 9:48 AM

Well, that's the point really, and the reason I recommended it. They're *models", only a shadow of reality. The distinction is lost on many folk, but he makes the point amply.

I don't think so. Others have written the other books about *how* to model. Kent limits his goal to pointing out the limits of what we can expect from a model, any model.

If you liked that, and want to learn *how* to model, you should read "Semantic Modeling" (the book I haven't written yet:) ) or make do with Terry Halpin's amazing book "Information Modeling and Relational Databases. It actually doesn't have that much to do with RDBMS, it's really about how to use fact-oriented modeling and why it's superior to more common approaches.

A truly worthwhile result then :).

Right. The Agile movement coined the acronym "YAGNI" - You Ain't Gonna Need It - to communicate exactly this message.

Clifford Heath.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Nov 28, 2010 7:09 PM

Th eimpression I got from the first third (~60 pp) of the book was:

"This would be *so* much more poignant in a 'classroom'/interactive setting."

I suspect those 60 pp could have been condensed to a 5 minute "ramble". Then, those 5 minutes expanded to a two hour discussion of "particulars". And, been much more thoroughly "absorbed"/intuited.

Sorry, I wasn't clear enough in my choice of title -- or the rationale behind it. I was trying to do homage to the fact that he was trying to identify all of the issues and subtleties involved in much the same way that you would have to do if you were trying to "lay a foundation" -- i.e., decide what terms

*will* be, what they *will* mean, etc.

I wasn't claiming that he was trying to find *the* way to model but, rather, defining a framework/lexicon to eliminate the current tautology...

Ah, must be on the shelf alongside all the books *I* haven;t written, either! :> (and, set theory tells us that all of those non-books are, in fact, the very SAME book! :> )

What's the cliche? Those who can, do; those who can't, teach? :>

I'm not really interested (at this stage of my life) in opening a new can (of worms). My interest (in these aplications) lies in trying to apply more structured, "heavier weight" frameworks to ad hoc "table driven" algorithms (of which I am a big fan). So, instead of obscure, rigidly defined tables of cruft buried within the code, move those tables into a more formal environment (e.g., RDBMS) -- though more expensive. Then, once there, see what benefits follow from that implementation.

(e.g., one obvious benefit is that the code becomes more easily extensible).

Unfortunately, this requires a considerable mind shift in how you approach the data and its representation -- what you store and how you store it.

For example, last night I had the (tentative) revelation to: "never use anything 'user-visible' as a key for a fundamental join" (i.e., fabricate something less visible -- a sequence field -- and do the join on that so the user-visible field can be freely altered).

Well, that will depend on how much WORK that saves/costs me in the long run! :> OTOH, the peace of mind that follows from having a "simple rule" to follow borders on "priceless"!

I'll admit that this leaves me with a feeling of unease -- mainly because I am, on the one hand, thinking that this approach will be a win *for* "extensibility" yet, this decision speaks directly *against* that "ability".

It will (hopefully) be a learning experience.