
There are a number of techniques I apply when I review a data model. Using these
techniques over the years helped me quickly summarize models and look for ways
to improve them. Many times the 80/20 rule seems to prevail when reviewing a
model. That is, in 20% of the time we can validate 80% of the model. (That extra
20% can be a bit tough however!)
Many of these techniques I plan on one day publishing and making it a topic in
my modeling course. An example of such as technique is:
I look at each data element and quickly identify which data elements do not end
in a valid classword. A classword represents a general domain for the data
element and examples include AMOUNT, DATE, NAME, IDENTIFIER, INDICATOR, and CODE
(there are others too). Finding data elements that don’t end in a valid
classword represent a naming standard violation at a minimum, and can also
signal a larger problem where multiple pieces of business information might be
stored in the same data element. For example, is Gross Sales an amount, a
weight, a value, or worse yet, a combination of these? That is, should the right
data element name be Gross Sales Amount, Gross Sales Weight, etc.?
Now I know there are a lot more tips out there that you might use when you
review a model. Share these with me. I will compile some of these techniques
(including more of my own) and send them back out in this design challenge
response. Note that as with all design challenges if I use your technique (and
it’s not already on my list) I will give you credit in any documentation.
Looking forward to hearing your techniques!
Response
So what are some of the handy techniques we use when reviewing a data model? The
more models we review, the more techniques we tend to apply either formally or
instinctively. This design challenge contains several of the very useful
techniques that were sent to me for reviewing a design, thank you for the great
responses! A special thanks goes out to data modeling legend Graeme Simsion for
his responses.
I placed each of these techniques within categories and there is no priority to
the order I listed the categories in or the techniques within each category.
They are all important techniques, and this design challenge focused on
identifying the techniques rather than rating them. As you read each of the
techniques, do you see any new ones you can start using?
(The name in parenthesis following each technique is the person who submitted
the technique.)
[NOTE: Whenever you see notes like this within the square brackets, they are
additional thoughts I have on the technique.]
Validating the model with the business and functional
experts
- Express relationships and subtype hierarchies as business assertions eg “Each
Person may hold one or more accounts and each account must be held by one
person.” Richard Barker’s technique was the first I saw; Graham Witt’s
which incorporates entity and attribute definitions to provide a complete
description of the model is the best. The usercan workalone on this
(initially at least!) ticking off – or querying – assertions serially. (Graeme
Simsion) [NOTE: This is a very powerful technique, especially when the
business or functional experts prefer to stay clear of the boxes and lines on a
data model. We are simply reading them the story on the design. When you do
this, I strongly recommend reading them what the relationship includes, as well
as what the relationship excludes. For example, in Graeme’s example, if a Person
can hold one or more accounts, that would mean a Person can not hold zero
accounts. I would ask the business experts whether this is what was intended,
and if so, we have it modeled correctly.] - Start to gently query the author about some of the non-obvious business rules
during a model-walkthrough session. (Ashraf Kandil) - Perform transactional analysis. Ensure that each of the target transactions or
desired reports can be produced based on the suggested model. (Ashraf Kandil)
[This is a very useful technique when designing a data mart where validation can
be done by reviewing a report and making sure everything on the report can be
represented somewhere in the model. A word of advice here: there is a very good
chance your design might require more flexibility than what appears in a given
report. For example, if there are four levels of product reporting within a
certain report, build your design with enough flexibility to handle the maximum
number of levels your experts feel might exist in the near future.]
Data element and entity names
- Check entity names for non-singular and group names – eg Customer List, Product
Catalogue. For consistency (and in building assertions) entity names should
describe one instance. (Graeme Simsion) - I look at each data element and quickly identify which data elements do not end
in a valid classword. A classword represents a general domain for the data
element and examples include AMOUNT, DATE, NAME, IDENTIFIER, INDICATOR, and CODE
(there are others too). Finding data elements that don’t end in a valid
classword represent a naming standard violation at a minimum, and can also
signal a larger problem where multiple pieces of business information might be
stored in the same data element. For example, is Gross Sales an amount, a
weight, a value, or worse yet, a combination of these? That is, should the right
data element name be Gross Sales Amount, Gross Sales Weight, etc.? (Me) - Ensure that if an entity/table/attribute/column naming standard already exists
in the organization that it has been adhered to. Otherwise improvise with one
and apply it consistently across the model. (Ashraf Kandil) [NOTE: As modeling
tools continue to add features, naming standards validation should eventually be
an automatic and not a tedious manual activity.]
Model appearance
- If you’re presenting a diagram toa non-specialist, start with a “high level”
diagram stripped ofthe “less important” entities. Use your judgement
here: the purpose is to get theperson oriented; the actual verificationwill
take place against the full model. Another optionis to draw the important
entities larger – if your tool will let you. (Graeme Simsion) [NOTE: I build
“high level” models often for the reason Graeme mentions. It’s similar to the
concept of reading a table of contents to a book to get the high level
understanding of that book as opposed to jumping right into page 1 of the text.] - A favorite technique of mine when review with the business, is to divide the
logical model into business subject areas. This is easily done with
ERwin’s subject areas. Most often, the subject areas correspond
todiscrete business processes, e.g.,add a customer, send an invoice, etc.
I walk the business through each subject area – one at a time
-describingentities,dataandrelationships in the context of the business
process and business rules.(Mary Niva) - Check model clarity suggesting the use of subject areas if overly complex. Try
for a one page fit for each subject area. Investigate the potential of using
colors to highlight important entities eg, yellow for transaction entities and
plain white for valid-value set (lookup list) entities. (Ashraf Kandil)
Rules of normalization
- Verify each entity is in third normal form. Ask intelligent questions about the
ones that aren’t and evaluate the answers. (Ashraf Kandil) - Look for 3 (or more)-way intersection entities. Check the rules to ensure
4th and 5th Normal form have not been violated (not so hard when you think
“rules” rather than “multi-valued dependencies!”) (Graeme Simsion)
Definitions
- Look at a selection of definitions – poor definitions are one of the most common
faults in a model, and often overlooked as the focus is on the graphical
representation. A quick glance will usually tell you whether the modeler
has taken this task seriously. (Graeme Simsion) - Ensure that entities and attributes have good solid meaningful definitions that
are more than just a repetition of the name. Encourage the provision of examples
to enhance understanding. (Ashraf Kandil) [NOTE: How many of us have seen
definitions that simply rephrase the data element name? These “quick and dirty”
definitions can escape notice when searching for missing definitions.]
Model flexibility
- Look for “head scratchers” (self-referencing one-to-many and many-to-many
relationships). Is the flexibility of an infinite hierarchy / network
really required or has the modeler just been too lazy to model the real
structures? (Graeme Simsion) - Look at how “parties” have been handled. Many, many modelers will
generalise “all people and organizations of interest to the enterprise” into
some sort of Party, Party role etc structure without checking whether the
business needs that level of generalization eg the business may be quite happy
to treat suppliers and shareholders as totally disjoint (even if in the real
world they overlap). (Graeme Simsion) - Listen to the business case/problem and verify that the model is addressing it
and not an academic exercise DON’T MODEL FOR THE SALE OF MODELING (Ashraf
Kandil) [NOTE: There has to be good business reasons why we chose a particular
design structure.]
Keys
- Look for meaningful keys (ie non-surrogate keys). My view is that these are
acceptable sometimes, but we should always check that they are applicable for
all cases, unique, and stable (the last of these three is the most common
problem). (Graeme Simsion) [NOTE: If you do decide to create a surrogate key,
make sure you define an alternate key on something real such as what the
business would view as the primary key.]
Specific to the physical
data model
- Check for Recursion. Look for any recursive relationships. Verify that they have
been correctly resolved. Ensure that the resolution comes with a Type entity.
(Ashraf Kandil) - Suggest a Super/Sub type resolution approach. (Ashraf Kandil)
- Identify if the nature of the target system and verify if OLTP or DW. Check for
any controlled redundancies for the latter. (Ashraf Kandil) - Verify if any logical model entities should be combined into physical tables to
achieve performance expectations. Verify correct dimensional Hierarchies have
been established taking into account business nature and physical modeling
science. (Ashraf Kandil) [NOTE: Remember to apply the Denormalization Survival
Guide here from Chapter 8 in my book – a shameless plug for my book!] - Perform throughput analysis. Identify that, where possible, high volume
transactions have been well catered for to achieve performance expectations.
(Ashraf Kandil)
Meta data quality
Does the meta data match the data? If at all possible, try to get at the actual
data that will be stored in the data elements you are modeling. You might be
surprised at what you find, but better to be surprised early on in a project
than during the actual development. For example, I recently worked on an
application where an email address was being stored in a company name field. It
can sometimes be challenging to actually get to the data, but once you get
access to the data simple SQL queries can help validate that the data element
meta data matches the actual data. (Me)
1 Comment
-
Often a data model diagram by itself will be given to me the request “Can you see if this model is ok?”.
My polite and firm response always is “Please give me the business context. Only then will I review.”Also I am often asked to review a data model after reverse engineering – with lttle else to go with.
And the table in such models can have single element surrogate id-s, without any clue about the “natural key”.It is essential to avoid deep discussions about primary key, alternate key and/or surrogate key with business folks.
Instead I found the following question very effective in extracting the required details:
“How can you tell apart one row from another?”
I persevere and persist until the following sentence can be completed:
“Each row of this table represents <>”.And I am simply amazed just how poorly formed these surrogate keys can be.
At times, a single table is indeed a mash of multiple tables, with different natural keys.
All hiding in the same table, courtesy of ill-conceived surrogate id.All I can say is that the above technique works all the time. 🙂