Validations that probably ought to happen for reference modes

Last post Thu, Aug 19 2010 19:19 by Anonymous. 7 replies.

Page 1 of 1 (8 items)
	Sort Posts: Previous Next

Tue, Aug 17 2010 11:32

shada22
Joined on Fri, Dec 19 2008
Posts 18

Validations that probably ought to happen for reference modes

Reply Contact

I was doing some fiddling with the ORM diagram discussed at: http://www.ormfoundation.org/forums/t/589.aspx when I noticed that the generated SQL Server output contained TWO colorID IDENTITY columns. This was clearly due to me using .ID as the reference mode instead of .code as in the linked discussion. However, shouldn't a validation error occur here because an AutoCounter reference mode IS NOT suitable for absorption? Additionally, after changing .ID to .code there wasn't a validation error for 'not specifying the length of the FixedLength data type' as well. I haven't quite gotten around to debugging the problem yet, but thought I would post a note here about it. Fyi, Bill

Tue, Aug 17 2010 13:12 In reply to

Matthew Curland
Joined on Sat, Mar 8 2008
Posts 450

Re: Validations that probably ought to happen for reference modes

AnalyzeIdentityColumns.zip

Reply Contact

Bill,

We are not currently checking for multiple identity columns in relational tables. In fact, we currently don't do any datatype checking in the relational model in the tool (the datatype properties on the columns are smoke-and-mirrors that directly edit the underlying value types). Basically, the datatypes we are currently using are considered placeholders and will be replaced in the future with a full meta model support for both intrinsic and custom data types, including the ability to formally define facet requirements (such as length on a fixed length string data type). Until that time, the scale/length/precision fields are for all intents and purposes unvalidated data.

The issue of how much SQL validation we can do up front in the tool needs to be carefully considered, especially given that the specific DBMS target is pretty good at compiling DDL and verifying their implementation-specific requirements. The cost/benefit of making NORMA a full-blown SQL compilation checker makes it unlikely we'll ever be 100% for all targets. The in-memory form is blissfully unaware of the final DBMS target. For example, key length has different limits on different systems, and there are other implementation issues that map differently on different systems (keys with optionals in SQL Server require views, but not in other systems). In this case, the notion of IDENTITY is a shorthand for the SQL Standard notion of SEQUENCE (which effectively allows multiple IDENTITY columns in one table), but many DBMS targets do not implement this. Basically, the approach is to write an intermediate format that can then be analyzed for the target system and tweaked accordingly based on the system's capabilities.

The most common case where multiple ID columns end up in one table is with absorbed subtypes that use alternate identification schemes. However, there are other cases (1-1 absorption, full absorption through functional roles) that can result in this pattern. I really don't want to consider these until I've rehashed the datatypes because I'm just digging a deeper hole that I'll need to fill back in later. Certainly, having one AutoCounter datatype is a poor metamodel because the notion of 'autogenerated' is actually separate from the notion of the underlying datatype (you can auto generate different sized integers, floats, UUIDs, etc).

There is also the question of whether auto-generated data is conceptually significant, or if we should allow a 'target default auto-identifier' that is determined by the implementation target. For many absorption cases, this would completely eliminate the duplicate identifier because we would recognize that it had no additional conceptual information. For other generation targets (class models, etc), this would mean that we would not need to keep DBMS-style identifiers where an object reference would be sufficient to represent the concept.

Honestly, the current datatypes work well enough in most cases, so there are several big ticket items in front of rehashing the datatypes on the my list. As an olive branch, I'm attaching a preliminary transform that will catch this scenario at the DCIL level and split the duplicate ids into separate tables. To use the attached generator modifier:

Expand the .zip
Run the appropriate install bat file to install (pick VS2005 or VS2008). You will need to run as an Administrator on Vista or Win7.
In the ORM Generator Selection dialog for the appropriate file, expand the ‘Generated File Modifiers’ section and select ‘Analyze Identity Columns’

This should eliminate the multiple IDENTITY columns. Note that there is no attempt made to eliminate identity columns that are not conceptually significant, and there are a couple of undones regarding use of the name generation settings. This is something I would consider integrating into the default transforms, but I'd like some feedback on it first.

-Matt

Tue, Aug 17 2010 14:28 In reply to

shada22
Joined on Fri, Dec 19 2008
Posts 18

Re: Validations that probably ought to happen for reference modes

Reply Contact

Matthew Curland:
Bill,

We are not currently checking for multiple identity columns in relational tables. In fact, we currently don't do any datatype checking in the relational model in the tool (the datatype properties on the columns are smoke-and-mirrors that directly edit the underlying value types). Basically, the datatypes we are currently using are considered placeholders and will be replaced in the future with a full meta model support for both intrinsic and custom data types, including the ability to formally define facet requirements (such as length on a fixed length string data type). Until that time, the scale/length/precision fields are for all intents and purposes unvalidated data.

The tool does verify non-zero lengths for Value data text types doesn't it? I was surprised NORMA didn't validate the ReferenceMode data type as well since the .orm file translates the reference mode into a Role.

Matthew Curland:

The most common case where multiple ID columns end up in one table is with absorbed subtypes that use alternate identification schemes. However, there are other cases (1-1 absorption, full absorption through functional roles) that can result in this pattern.

To expand on what I saw:
The Color entity was completely absorbed into these two roles:
* Collar is of Color
* The Pennant is of Color
This is essentially: Full absorption through functional roles as you mentioned.

Matthew Curland:

I really don't want to consider these until I've rehashed the datatypes because I'm just digging a deeper hole that I'll need to fill back in later. Certainly, having one AutoCounter datatype is a poor metamodel because the notion of 'autogenerated' is actually separate from the notion of the underlying datatype (you can auto generate different sized integers, floats, UUIDs, etc).

Ah yes, differing kinds of autogeneration would indeed lead to different absoprtion decisions. If the auto generation needs to occur in a central place (like SEQUENCEs) then you would want to prevent absorption.

However, with Auto UUIDs this might not be necessary since the UUIDs would "never" collide they could be generated wherever the value was absorbed .

I think the eventual data type system ought to at least distinguish between the two cases somehow since SEQUENCEs are so widely used in relational schemas.

That is to say that a AutoCounter SEQUENCE reference mode ought to be viewed as being requiring the entity to be independent if nothing else requires the entity to exist on its own.

Alternatively, generation targets could be put in charge of registering their data types/editors using a plugin metaphor. Then provide several built in datatypes for the standard SQL patterns. The plugins could then participate in some of the validation logic.

Matthew Curland:

There is also the question of whether auto-generated data is conceptually significant, or if we should allow a 'target default auto-identifier' that is determined by the implementation target. For many absorption cases, this would completely eliminate the duplicate identifier because we would recognize that it had no additional conceptual information. For other generation targets (class models, etc), this would mean that we would not need to keep DBMS-style identifiers where an object reference would be sufficient to represent the concept.

Auto-generated data is hardly ever conceptually significant. It is, however, frequently practically significant when dealing with making facts searchable efficient. (due to the data store underlying the code).

If the underlying data store was an OODB store then you probably wouldn't be using a SEQUENCE reference mode. It might be a hidden detail of the OODB system in use.

Matthew Curland:

Honestly, the current datatypes work well enough in most cases, so there are several big ticket items in front of rehashing the datatypes on the my list. As an olive branch, I'm attaching a preliminary transform that will catch this scenario at the DCIL level and split the duplicate ids into separate tables. To use the attached generator modifier:

Wow, you didn't need to do that for me. The solution to the problem was straight forward enough when I noticed it.

i.e. make the Entity Independant
or change the reference mode to something sensible for Color (i.e. .code) so that the absorption makes sense.

I'd rather have a data type driven validation warning here rather than making the transforms back out an absorption decision. Didn't VisioModeler have a warning like this?

Thanks a bunch for your detailed responses!

Bill

Tue, Aug 17 2010 16:46 In reply to

Matthew Curland
Joined on Sat, Mar 8 2008
Posts 450

Re: Validations that probably ought to happen for reference modes

Reply Contact

shada22:
The tool does verify non-zero lengths for Value data text types doesn't it?

I don't think I've ever added any validation for this. You just get a max length string. This will come with formal facet definitions. However, right now, we essentially use two slots for all of the facet data. The display name for this data and whether or not it is serialized is up tot he data type, but we do not do any additional checking on it.

shada22:
UUIDs would "never" collide

Yes, this does make them special, because auto-counters collide all the time. This is important metadata because making autocounter ids unique across the system is irksome, whereas this happens automatically for UUID.

The other issue I didn't mention with a second IDENTITY column is that SQL (at least, probably others as well) ignores the nullable bit, which means that an optional absorbed identifier ends up mandatory, which causes a silent validation failure if the primary table does not have its own IDENTITY column to clash.

The idea of generation targets being able to impose limitations and additional options the selected data types is interesting. For example, the choice of string storage (UNICODE, etc) is not a conceptual choice, but is a choice as you map to a DBMS. The limitation question, where a generation target imposes a data limitation based, is also interesting because this limitation may need to bubble back up so that other generated artifacts keep the data consistent. I don't think this will be as common as the target-specific properties. I'd still like to keep these under the umbrella of the conceptual data type though instead of adding a second data type.

shada22:
Wow, you didn't need to do that for me

I'm not quite that nice (check the file dates, I've had this for a while). I would like to know if it works for you, though, as I haven't had other feedback on it.

shada22:
I'd rather have a data type driven validation warning

I'm not saying we'll never incorporate data inputs into the absorption choices, just that I need more meta data than I currently have to introduce a real fix for the problem. There are actually two stages in the analysis: the first chooses possible absorption paths, and the second decides whether or not to use those choices. So, the final implementation absorption may not be the same in all targets. For example, an automatic form layout would not move absorbable elements to a subform simply because of an auto counter identifier. I won't necessarily hit par with VM in this area, I just don't have the man power to throw at it. I think VM did warn, but they didn't have any way to fix the problem without a change at the conceptual level.

Thanks for your comments,

-Matt

Wed, Aug 18 2010 20:39 In reply to

Re: Validations that probably ought to happen for reference modes

Reply Contact

shada22:
That is to say that a AutoCounter SEQUENCE reference mode ought to be viewed as being requiring the entity to be independent if nothing else requires the entity to exist on its own.

That's what ActiveFacts does.

Matthew Curland:
There are actually two stages in the analysis: the first chooses possible absorption paths, and the second decides whether or not to use those choices.

Again, this is exactly what ActiveFacts does. The "references" directed graph is generated, and where a one-to-one exists, a default direction is chosen (rather than creating two references). During the absorption process, some of these references may get flipped.

My absorption algorithm marks every object type with one of four marks: definitely/tentatively table/non-table. The process iterates a set of simple rules until no new definite decisions have been made. The core of the thing is only 100 lines of code, yet works (slightly) better than NORMA currently does..

A further feature I mean to add is a target-specific plugin at the end of absorption, but before any generation. When generating for Rails, which requires a single primary key field preferably called "id" (and auto-allocated, protected from assignment, etc), the plugin can inspect all the tables and inject a surrogate identifier where a suitable key isn't already present. On generation, this then propagated into all the FKs, of course.

Wed, Aug 18 2010 22:42 In reply to

shada22
Joined on Fri, Dec 19 2008
Posts 18

Re: Validations that probably ought to happen for reference modes

Reply Contact

After thinking about this some more. I'm becoming more convinced that a SQL Server IDENTITY and an Oracle SEQUENCE type really are two different conceptual data types and ought to be treated differently in the conceptual model.

The huge difference between the two is simple: An IDENTITY MUST exist inside of an Independent Entity, while a SEQUENCE is ALWAYS an Entity that is NOT Independent.

A SEQUENCE MUST only have the predicate of "Sequence has reference mode Sequence_id". (i.e. no other predicates can be allowed).

Therefore, an error on the conceptual datatype of IDENTITY does make sense when an IDENTITY reference mode isn't Independent and it isn't JUST a relational level concept. Additoinally, an error should also be generated when a SEQUENCE IS NOT Independent.

Differentiating between an IDENTITY and a SEQUENCE in the conceptual model would also substantially simplify the generation code derived from the model. Which certainly makes the modeling tool easier to leverage. As a former boss of mine would say (more or less): Let's create pits of success for our users instead of pits of failure/complexity/heartache. i.e. there would be no need for that silly pluggable datatype affecting absorption idea of mine.

Interesting realization I just had: Auto Counter reference modes (IDENTITY and SEQUENCE) are in fact derived fact predicates. They are derived facts from the generation algorithm they use in the implementation of the conceptual datatype. (i.e. the difference between an Oracle NOORDER SEQUENCE and an Oracle ORDER SEQUENCE) Other possible derivation examples include: the SQL Server unique-identifier type that always generates a UUID, or a derived reference mode predicate leveraging a SEQUENCE or IDENTITY to construct one of those annoying ids that some UoDs require like 'EMP00000123'.

Hoping I'm not annoying everybody too badly with all of my posts,

Bill

Thu, Aug 19 2010 10:52 In reply to

Matthew Curland
Joined on Sat, Mar 8 2008
Posts 450

Re: Validations that probably ought to happen for reference modes

Reply Contact

Hi Bill,

Interesting discussion. Honestly, the SEQUENCE implementation is spotty enough across DBMS targets that I haven't looked at integrating the notion.

We do need to be careful about use of IsIndependent to mean too many things. IsIndependent is often viewed as meaning 'is key total'. While key total is implied by independent, there are times IsKeyTotal would be available when IsIndependent is not. For example, any objectification of a fact type with a non-spanning uniqueness cannot be independent because the implied fact type for the role not under the uniqueness constraint is mandatory on the objectifying entity type. These objectifications can (and often do) have surrogate identifiers.

I think IsKeyTotal (all identifiers for an entity are available in a single location) is a conceptual notion that should be added to the core system. Opinions?

I'll have to think more on the composite derived identifiers. The issue is basically trusting a function to produce a unique value. I guess we're doing this with IDENTITY anyway, but I tend to trust the core DBMS functionality more than a user function.

-Matt

Thu, Aug 19 2010 19:19 In reply to

Re: Validations that probably ought to happen for reference modes

Reply Contact

Matthew Curland:
I'll have to think more on the composite derived identifiers

While you're thinking on it, consider whether there might be a good way to auto-assign one part of a multi-part key. For example, in ActiveFacts, each Role is identified by its FactType and a Position within that FactType. It'd be really nice to have the Positions auto-assigned for the Roles of each FactType.... The DBMS's don't provide this as builtin, but a trigger could.

Matthew Curland:
I think IsKeyTotal (all identifiers for an entity are available in a single location) is a conceptual notion that should be added to the core system. Opinions?

I'd like to see examples showing the use cases.

Page 1 of 1 (8 items)

The ORM Foundation