You are a data modeler designing a data mart, and have come across at least two different entities on your logical data model that happen to have the same primary key. On your physical data model, would you keep them in separate tables or merge them into a single table because these entities share the same key? Don’t answer just yet! I’m not looking for a response such as “separate tables” or “single table”. I’m looking for a set of guidelines on when to go which route. Read on for the specifics. Thanks to Ben Ettlinger for providing the idea behind this scenario. In your role as data mart data modeler, you have identified the following two main questions this data mart must answer:
- What are our company’s actual expenses by department and month?
- What are our forecasted expenses by department and month?
So your Actual Expense and Forecasted Expense logical entities look something like this (primary keys are above the line):
What should the physical entity (or entities) look like if the primary key to both of these entities is the Month and Department? What I’m looking for is not just an answer, but guidelines on when to choose which design. Imagine you are writing these guidelines for someone with minimal data modeling experience. Guidelines can also be in the form of a flowchart or spreadsheet, whatever medium you think is most conducive to explaining the options. Let your imagination run on this one. There are no restrictions and no wrong answers.
I call this section “A Response” instead of “The Response” because really there is more than one solution to this problem. Consider the following discussion to be one set of guidelines for determining one table verse two. What’s a bit tricky about this design challenge is that we are asked not to respond with just “I think Actual Expense and Forecasted Expense should exist in one table.” or “Actual Expense and Forecasted Expense must be in separate tables.” Instead we are asked for guidelines for when to go which route. I believe guidelines are extremely important to have in our field, because very little that we do as modelers are ever in the form of “In all circumstances always do this…”. So guidelines become a way of life for us modelers, whether we formalize them or go with a “gut feel”.
Do most queries require data from both tables?
The first question we need to ask ourselves (and the most important question when designing the physical data model for a data mart) is how will the data be used? If data elements from separate tables will be used together a majority of the time, then the answer is “yes” to this question, and we move on to the next question with the possibility of having a single table at the end of this decision tree. This is because the main reason we denormalize is to decrease table joins, and therefore reduce retrieval times and improve query and report generation performance. We would only want to have one denormalized table if the amounts will be reported together often. If the Actual Expense Amount and Forecasted Expense Amount will not be used together, then the answer to this question is “no” and there is no benefit to having them in the same table. If this is the case they should remain in separate tables and we’ve made our decision and are done.
Is one of the tables a placeholder?
This question asks whether new data elements and/or relationships will be added to just one of these tables in the near future. Adding new data requirements to a table can make us re-evaluate our question on usage. If these new requirements are queried only by themselves within the specific table, we might decide to answer “no” to the first question. There also could be space or performance ramifications with adding a large amount of new information. For example, If Actual Expense is planning on being expanded by 15 data elements and 2 relationships within 6 months, will this new information be used in conjunction with information from the other table a majority of the time? This placeholder question is usually raised by the designer or functional analyst, requiring someone with the skills and incentive to look beyond this specific data mart and see what is coming on the horizon.
Does usage outweigh complexities?
If one of these tables are going to be expanded in the near future, how big a deal is it? Is it worth giving up some of the query performance gains and keep them in separate logical entities to make it easier to enhance the structures in the future?
Is a view an option?
If we’ve decided that the answer is “no”, and it is worth it to keep the tables separate to make expansion easier in the near term, can a view give us the best of both worlds? This is usually an option if the complexity in creating the view is minimal and the performance of reporting on this view is not going to be an issue. With the latest database technology, features such as materialized views are a very handy way to get the best of both worlds. That is, the flexibility and user-friendliness of a single view with the performance of a table.
Any loading or database complexities in 1 table?
An interesting point to consider is how updates are done to both tables. Sometimes depending on how much history you are keeping, it can be very time consuming and complex to load plan data when it is such a small percentage of the table’s total size. Also, do you need to keep history on plan data? If the amount and level of detail of history varies greatly between the two tables, it could be challenging to load and extract data from these tables (not to mention the very high amount of redundancy that could occur). Imagine if we are keeping five years in the Actual table and completely refreshing data in the Planned table. If we have one table for both Actual and Planned the logic required to perform this operation is fairly complex and time consuming. These considerations are also heavily influenced by the specific database platform and environment. Database and loading complexities are usually raised by developers and database administrators.
Does usage outweigh complexities?
If there are load or database complexities, how big a deal is it? Is it worth giving up some of the query performance gains by keeping them in separate logical entities? Determining which option to take here can sometimes (unfortunately) be based on the influencing abilities of the designer(s) verse the developer(s). I’ve seen and participated in lively discussions (and arm wrestles and headlocks, etc.) involving entire teams where the pros and cons are weighed.
Some other considerations and comments
- I received several comments as to why there are two tables in the first place. That is, why isn’t there an entity called Expense that contains both Actual and Forecasted Expense? The answer is that there are many more data elements in each of entities that were not shown because they were not critical to working through this challenge. For example, Forecasted Expense contains certain condition data elements that were in place when the forecast was made, such as Demand to Supply Ratio. These condition data elements are fully dependent on Forecasted Expense and therefore in a normalized world exist in this separate Forecast entity.
- There are also several hybrid solutions to this problem. By a hybrid solution I mean one that contains some amount of denormalization (maybe only those portions that are shared in queries) and some amount of normalization. In this particular example with only one data element in each table obviously this situation can not occur.