FCO-IM Lexicalizing question.

Last post Wed, Apr 14 2010 13:55 by michell. 8 replies.

Page 1 of 1 (9 items)
	Sort Posts: Previous Next

Tue, Apr 6 2010 10:55

michell
Joined on Tue, Apr 6 2010
the Netherlands
Posts 4

FCO-IM Lexicalizing question.

Reply Contact

While reading the example model of converting an IGD to a relational database model (book "Fully Communication Oriented Information Modeling (FCO-IM)" found in the Library, section 4.2.1 "Lexicalizing in the Student-Project Case Study") a question arises.

In the lexicalizing step the normalization "student" is eliminated by splitting a role of a factype and injecting the student identifier "firstname" and "surname" for that role. In basis I understand this step, but when looking at the relational scheme (figure 4.12) two tables are created: student and preferences which include both the "firstname" and "surname" of students.

When I look at the database model, I would suggest to add a student ID to the table students. Which can be used in the table preferences, so that a reference can be made to the table students using one column "studentID". Why isn't that done/proposed?

Thanks for any suggestions,

Michel.

P.s. Thanks for the information on this forum and in the Library. It was already very helpful!

Wed, Apr 7 2010 3:29 In reply to

Ken Evans
Joined on Sun, Nov 18 2007
Stickford, UK
Posts 805

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Michel,

I'm pleased to hear that you found the website to be helpful.

Regarding your question on FCO-IM, I have asked the FCO-IM expert Jan Pieter Zwart to reply to your post.

Thanks for your contribution.

Ken

Wed, Apr 7 2010 5:12 In reply to

michell
Joined on Tue, Apr 6 2010
the Netherlands
Posts 4

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Ken,

Thanks for your reply. I just found some information in chapter 2.9 (p56) of the book. It states: "Sometimes, there is indeed a proper identifier, but the analyst and/or the domain experts consider it to be very impractical, for instance because it consists of many components. They decide to introduce a new identifier for reasons of efficiency."

That is exactly what I want to do, but it's confusing for me that I need to change the IGD for that. I expected such kind of optimalizations when converting the IGD to a relational database design (Chapter 4).

So, I found a way to avoid storing the same information twice in the database by adapting the IGD and introducing an identifier (just as in the student example from the book) and I'm pleased with the result. I'm still interested in the response of Jan Piter Zwart.

Michel.

Wed, Apr 7 2010 6:10 In reply to

Ken Evans
Joined on Sun, Nov 18 2007
Stickford, UK
Posts 805

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Michel,

I'm not an expert in FCO-IM but I found the quote to be quite confusing:

michell:
"Sometimes, there is indeed a proper identifier, but the analyst and/or the domain experts consider it to be very impractical, for instance because it consists of many components. They decide to introduce a new identifier for reasons of efficiency."

It seems to me that the assertion "Sometimes there is a need for a proper identifier.." does not fit the rules of a relational model.
My specific points are:

1. In the relational model, there is always an "identifier" and it is called the primary key.
2. The primary key may be made up of several columns. (so I don't understand what is "impractical" about this)
3. Efficiency is an implementation issue. It is not a logical modeling issue.

But maybe I have misunderstood FCO-IM and the FCO-IM experts may have a good answer to these points.

Ken

Thu, Apr 8 2010 10:33 In reply to

Jan Pieter Zwart
Joined on Tue, Dec 4 2007
Arnhem, the Netherlands
Posts 1

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Mitchell,

Ken Evans has forwarded your discussion about the FCO-IM Lexicalizing question to me. I’m happy to reply.
Please see my points below (in more or less random order, responding to your texts).
For the sake of simplicity let’s assume there is only one identifier (one candidate key, which is thus the primary key).

1. I think the issue you raise is best addressed while distinguishing between the conceptual level (that of an FCO-IM, or ORM, information model), the logical level (that of a relational database schema), and the physical level (that of an implementation in a concrete DBMS.

2. On the conceptual level, in FCO-IM and also in ORM, every object type must have an identifier (unique name). This identifier can consist of several components. On the logical level, this translates into a primary key with more than one column. There is absolutely no problem with that on either of these levels, and I agree completely with Ken’s comments (04-07-2010).
3. The physical level is where performance issues and other technical matters ‘under the hood’ are addressed. These performance issues are presently only of concern for very large database systems (in terms of records per table; datawarehousing with large ‘fact tables’ for example). In such situations, it may be advisable to replace a compound key with a single dummy key. That has drawbacks too, however: a) the alternate key is still there, and must be checked and maintained by the system and b) it changes the facts for the users (suddenly they have to use a number they never needed before). Messing with the way users communicate about their UoD should not be done lightly.
4. Perhaps you do not want to change the user communication, but just intend to implement the dummy key and hide it from all users by replacing it with the ‘real’ one in all interfaces and reports. That would require a lot of programming work, but it can be done. Then you would not need to change the conceptual level or the logical level, but only the physical level (and have a separate physical database schema, augmented with the full set of routines for maintaining the alternate keys and hiding the dummy key). Note that I don’t want to change the logical level for this purpose (but others may disagree).
5. In practice, the benefits of introducing dummy keys are gone for all but the largest systems: even in the big administrative system of HAN University, performance may be worse after introducing them (yes, we measured that). I discourage students strongly from succumbing to this ‘number disease’ rashly.
6. Fortunately, most RDBMSs can handle compound keys, and FK references to compound PKs, automatically nowadays.

7. Is your feeling of redundancy “storing the same information twice” caused by the fact, that on the logical level a child table has foreign key columns which are copies of the primary key columns of the parent table? If so, I agree, and I think it is a shortcoming of the relational model. (An ER-model can be seen as a relational database schema without this redundant copying of attributes. ) However, replacing the compound key with a single-column key doesn’t help here: both tables will contain the same column still.
8. But this is a form of redundancy on the metadata level, not on the data level. No complete fact is stored more than once: both FCO-IM and ORM deal with complete elementary facts, whereas the relational model deals with attributes, which are just parts of complete facts. No complete fact is stored more than once.
9. A direct answer to your question of 04-06-2010 “Why isn’t that [introducing StudentID] done/proposed?” is then: It isn’t necessary and it would be harmful to the communication of the users.

10. Now for the passage in chapter 2.9 (p.56). The book was written in 1995, when the performance issue was more important than it is now. So we decided to include a section to show the conceptual changes involved, if we would introduce a dummy key and actually force all the users to change their communication from ‘student Peter Johnson’ to ‘student S1’. Since ‘student S1’ has now become the standard identifier, the IGD must change accordingly. (Of course, if you aim to do what I described in point 4 above, no change in IGD is necessary.)

11. Finally for Ken: I’m sorry if our English is unclear, so I’ll rephrase the start of the first sentence quoted from our book: “It might be that, although there actually is a proper identifier, the analyst and/or the domain experts consider this key to be very impractical …”. I agree completely with your remarks.

Kind regards, Jan Pieter Zwart

Fri, Apr 9 2010 5:08 In reply to

michell
Joined on Tue, Apr 6 2010
the Netherlands
Posts 4

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Jan Pieter Zwart,

First of all, thank you for the clear comments you gave on my question. Your point is that it is not a good idea to introduce identifiers on the conceptual level and I agree. At this point I did change the IGD and introduced an identifier, but I will reconsider this decision.

In reaction to point 3: The database I'm designing is very small so there will never be any performance issues.

Point 4: Yes that is exactly what I want. The "extra" identifier will be hidden from the users.

Point 7: I will try to explain why I used the words "storing the same information twice". Lets assume the student example again. Suppose the database is filled with a student name, but a spelling error is introduced in the name. The only solution to this problem is by editing two tables. In the example of the extra identifier only one table has to be edited. Thats why I don't like the fact that a student name is stored "twice". And it becomes even worse: student names are almost never unique, so the primary key will likely be a name, postal code, number, and maybe more. Isn't it very confusing to repeat all this information in multiple tables?

Another note: when I tried to find a anwer to my question in your book I carefully read chapter 4 because I expected a note on this issue here. That chapter describes the logical and physical level (generating DDL). Therefore I was surprised to find information about introducing unique identifiers in chapter 2.9, which _is_ about changing the IGD, and that is what I did.

Methods like ORM and FCO-IM let me design a database, in which I can explain all decisions easily (based on text, to structured sentences, and clear diagrams which I use to derive the database design). I was searching for the last part of the puzzle where I tried to justify the identifier in the database design phase. But maybe there is no good justification for that identifier in small databases.

Thanks,

Michel.

Fri, Apr 9 2010 6:13 In reply to

Ken Evans
Joined on Sun, Nov 18 2007
Stickford, UK
Posts 805

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi Michel,
I'm not an expert on FCO-IM so I'll leave it to Jan Pieter to comment on the FCO-IM aspects of your post.

What I'd like to mention is a matter of modeling principle.
I'm wondering if you are talking about two different kinds of "identifier" here. As I see it, there is definitely a need for a "conceptual identifier" which is quite separate from a physical identifier such as the "OID" in the OO paradigm.

The purpose of a conceptual identifier is to provide a way to uniquely identify a "thing" in a universe of discourse.
So in your example, it is common for a University to allocate a "Student_ID" to each student precisely because of the probablity that the combination of "Given Name" + "Family Name" will not be unique. Furthermore, such a Student_ID would certainly be known to each student and in many cases would be printed on a document such as the "Student Identity Card". So it seems to me that artefacts such as "Student_ID" are a perfectly valid way to identify things at the conceptual level. And this would most likely produce a table with Student_ID as the primary key and data such as Given Name, Family Name and Date of Birth as columns in the table.

On the other hand, the purpose of a physical identifier such as an OO OID is to provide a unique identifier for an "object" when the "object" is stored as a block of code in computer memory.

And, as Jan Pieter has said, the "old" use of identifiers (aka surrogate keys) in database tables was to provide a way to improve database performance. Thankfully, Moore's Law and improvements in RDBMS algorithms have helped us to escape from this need in all but the largest of databases.

Hope this helps.

Ken

Sun, Apr 11 2010 21:54 In reply to

Re: FCO-IM Lexicalizing question.

Reply Contact

I believe Mitchell is right that fact-oriented relational mapping tools ought to provide optional hidden surrogate keys. I've made that point previously, and plan to implement it in ActiveFacts (the project which is delivering the Constellation Query Language).

Jan Pieter Zwart:
These performance issues are presently only of concern for very large database systems (in terms of records per table; datawarehousing with large ‘fact tables’ for example). In such situations, it may be advisable to replace a compound key with a single dummy key

I've not found this to be true in my practise. Even medium sized databases see substantial benefits from the use of surrogate keys. I applied the principle of "no surrogates, only natural identifiers" in my early designs, and when those databases faltered and even failed outright on performance requirements, refactored them with both natural and surrogate identifiers in almost every table (join tables excluded). The most relevant example is the database used by the ManageSoft product, whose database (depending on subproduct configuration) is around 500 tables and 3000 columns, and in typical installations runs up to 100 GB - or about $10 worth of disk storage. I don't believe $10 worth of storage can be considered a large database, but it's certainly medium-sized. Many of the natural identifiers are variable-sized strings of up to several hundred characters; with Unicode storage they may be a kilobyte. When you form a join table of more than one such key, and index it in a B+Tree with a typical page size of 8KB, the index fanout completely kills application performance. The B+Tree pages average 75% full, which gives you a fanout of 6. A 2-way join table of 10E8 records forms an index that's 11 levels deep. The same index over 4-byte surrogate keys has a fanout around 350 and a depth of 3 levels. A full scan requires transfer of just over a gigabyte from disk, compared to sixteen gigabytes.

The argument about Moores law doesn't cut it I'm afraid. These databases are running on expensive fast hardware, 24x7, with little headroom, due to the high update rate. If you reduce the performance, you must reduce the incoming update rate, which means that the latency passes the acceptable limits. Moores law describes CPU performance, but application performance is driven by bandwidth. For database applications, that means disk bandwidth, which hasn't followed Moores law at all.

However, the use of surrogate keys has distinct dangers. Some programmers think they can avoid looking for a natural identifier, which is quite simply the road to hell, because it leads to gross modelling errors. Other problems are as Jan Pieter has pointed out.

Further to your point 7: The relational model is built on the idea of joins, and that requires two columns to contain the same values. It's not a fault of the relational model; in fact it's the primary value of the relational model. There are other models, but historically they've been found to be inferior.

Mitchell: Referential integrity constraints prevent the mis-spelling problem you describe. Please make sure you use them!

Wed, Apr 14 2010 13:55 In reply to

michell
Joined on Tue, Apr 6 2010
the Netherlands
Posts 4

Re: FCO-IM Lexicalizing question.

Reply Contact

Hi all,

Thanks for all the replies. I made the decision to remove the "extra" identifier and just use a larger primary key. This is clear in all communication... and I do not worry about performance issues since my database is very small.

With kind regards,

Michel.

Page 1 of 1 (9 items)