Why is it advisable to use surrogate keys in the data warehouse?

uga_boy

Jan 20th, 2012

Surrogate keys are advisable for a few different reasons:

1. In a large enterprise environment with many data sources to pull from you will inevitably come across a data feed that either does not contain a unique key of its own or duplicates existing keys in other disconnected systems you have to extract from and finally...

you never can trust a source systems data. It is much better to have warehouse keys to uniquely identify your dimension members rather than trying to use disconnected surrogate keys from an external datasource.

2. You get to control the partitioning of large data sets by creating your clustered indexes on the warehouse keys as well as being able to create non-clustered indexes on all referenced tables greatly speeding up query response.

3. If you attempt to use the naturally occurring business keys.. you are going to run into issues with enormous indexes and very complex join operations that arise from business entities that use more than a single column to define the business key. (Example in a bank account dimension: bank number + account number = unique account number, you cant place a composite primary key on those two columns in a type 2 dimension because you must track history, better to ensure a unique key for every record).

4. It keeps your fact tables as small as possible. And gives you the ability to store historical facts pointing to previous versions of the member records, not just the current one.

There are others but those are the ones that immediately came to mind. Reconciliation purposes and batch load tracking are others.