Handling Slow Changing Dimensions (SSIS)

This post will explain the following:

  • Slow changing dimensions (SCD0,SCD1,SCD2 and SCD3)
  • Inferred Members
  • How to configure a SCD component in a data flow using SSIS2012

Slow-Changing Dimensions are a special data warehouse modeling technique, that allow you to manage how your ETL process will resolve updates and inserts into a dimension, so that you can tell your packages what to do if an attribute value changes.

http://rpmachado.wordpress.com/wp-includes/js/tinymce/plugins/wordpress/img/trans.gif


There are currently four three accepted SCD types:

  • SCD1 in which you will overwrite an attribute value if the record has changed (I don’t need historical data );
  • SCD2 in which you will create a new record whenever an attribute value changes (I want to keep all historical changes);
  • SCD3 in which you create N columns in a dimension to keep N changes in it. (I just need the last N changes recorded).

Some authors refer another type, when no change (fixed attribute) is applied to an attribute, the SCD type is 0. I will assume a forth type as being SCD0.

For some time, the Integration Software available in the market didn’t support any king of wizard or automation while developing SCD ETL packages which means that if you wanted to handle a Slow-Changing dimension inside your packages, you had to create the whole data flow logics to handle it, which its not anything bad at all for me, because after you make it a few times, you will see that its always the same thing.

However SSIS gives you the SCD Component inside data flows so that you can use a wizard to help you creating the flow for handling SCD, however there are two important notes for you:

  • SCD Component doesn't support SCD3, so, if you want to use it you need to develop the flow yourself.
  • If you want to use SCD2, you have to create two new columns in your dimension which are the begin date of the record life (SCD_DT_BEGIN, for example) and the end date of the record life, which represents the time in which a new record with different attribute values was identified in your ETL process, (SCD_DT_END, for example)

Another important thing for you to know, if you aren’t used to use SCD dimensions is that in the same Dimension you can have SCD1 attributes, SCD2 attributes and even SCD3 attributes, because when a change occurs in a specific attribute, you might not need to record it (SCD2) but you might want to update its value. These requirements will depend from businesses to business.

This way, lets start making an example to show you how easy it is to use the wizard inside SSIS2012. I will show you all available options inside it, using all three SCD types in a single dimension. My scenario is based on a football referee operational system which will lead to and SCD dimension called DimReferee. Both are very simple dimensions with few attributes.

The source table is as follows:

http://rpmachado.files.wordpress.com/2014/01/wpid-table_source1.png

Our DimReferee is structured as follows: (As you can see I have already added the start and end date attributes which will be used to flag if a SCD2 record is active or old. As I have already told you SSIS SCD component allows you to use a Status column as well)

http://rpmachado.files.wordpress.com/2014/01/wpid-destination1.png

Now that both columns are created,  lets go to Integration Services. The first task is adding the SCD component to the data flow of your Dimension. The following figure shows you how it should look like. Our example package will be very simple, just a Source component to read the referee data from the source, the flow generated with the SCD component and a destination component to send our data do the Dimension.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd11.png

To start configuring the SCD component double click it to open a Wizard that will help you configuring you Slow Changing Dimension. Once you do it, SSIS will show you an initial screen as the following. Hit the next button on it.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd21.png

Once you do it the configuration wizard will show you one the most important screens. This is where you define the mappings between source and destination attributes, and the Business Key which will be used to look up the record in the source table. As you can also see you need to define the connection manager for the Dimension and the table you want to transform in a SCD dimension.

For our example lets map all attributes as follows and choose the id_referee (Primary key of the Referee table) as our business key. The start and end date will stay blank as they will be configured in a later screen.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd1_set_business_key1.png

After hitting the Next button we enter the most important screen in the SCD point of view. It is in this screen that you define the type of Slow Changing Dimension for each attribute mapped in the first screen. You have three available options:

  • Fixed Attribute (SCD0)
  • Changing Attribute (SCD1)
  • Historical Attribute (SCD2)

Has I have told you before SCD3 is not supported in the SSIS component. In this example I will use all of the types, so when a referee name changes I don´t want to do anything because for my business I will assume that his name won´t change (SCD0). However when a referee marriage status changes I want to overwrite its value to the new one (SCD1) and when his address changes I want to save the old value so its need to be an Historical attribute (SCD2).

http://rpmachado.files.wordpress.com/2014/01/wpid-captura-de-ecrc3a3-2014-01-29-c3a0s-20-51-031.png

Once you click in the Next button SSIS will prompt you a new screen, so that you can choose what should it do, when an attribute value defined as Fixed changes and if you want to update even the SCD2 historical records if a type SCD1 value changes. In the first case you can tell SSIS to raise an error if a defined SCD0 attribute changes its value and the change is detected. In the second case if an SCD1 attributes changes its value, you need to define if you just want to overwrite its value in the most recent record or in all of them (Historical ones).

The last case only makes sense if you have an SCD2 attribute in the dimension. In our example lets raise an error if the referee name changes, has if it happens we might have misunderstood the business and it might need adjustments. In the second check box lets leave it blank as we just want to update the active record only.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd1_set_what_when_changesinfixed1.png

The Next screen is where you define if you will use a Status column to indicate the current (most recent/active/alive) record or if you prefer to use a time interval. As I have already told you, I prefer to use the dates, as they allow me to now in which period did that record live. This way lets use our start_date and end_date attributes as a flag to the current record. We also need to indicate which type of variable are we going to use to populate the dates. To do so you have three options:

  • ContainerStartTime - The start time of the container in which the package is in.
  • CreationDate - Date of creation of the package
  • StartDate - The start date of execution of the package*

I prefer the last option as it is the most near of the SCD flow in the package, reflecting the most recent date. However for this example I left the CreationDate. After that click in the Next button.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd_set_start_end_date1.png

The next screen is delicate. An interesting concept related to data warehouses using SSIS and SQL Server is an inferred member which is a record from your dimension, referenced by fact tables, however the record is not yet loaded to the dimension. This means that a particular record can the referenced without existing. Might occur if for some reason you need to load facts without having the dimension record data yet, so you create a Blank/Null record in a dimension, that won’t have any data, just its primary key. The next screen allow you to tell SSIS what to do when the data for that inferred member arrives, either update the record or creating a new one. If you activate the inferred member support you have two options:

  • All columns with a change type are null - This means that for every column in a inferred member record that is configured with and SCD type, SSIS will set its values to null.
  • Use a boolean column to indicate whether the current record is an inferred member - In this case SSIS will use a column created by you, to flag if the current record is a inferred member.

http://rpmachado.files.wordpress.com/2014/01/wpid-scd_inferred1.png

Once you click next you are done. SSIS will now automatically create the flow for you, which as you can see in the following figure is a lot of time saved with transformations and components configuration. However if there are further transformations such as data types conversions, you still need to do them. For instance in our example we still need to convert some data types before running our package.

http://rpmachado.files.wordpress.com/2014/01/wpid-capture1.png

In other to have a complete usable data flow , you have to add some conversion components to respect data types, but i am pretty sure that for that  you won’t need my help. However I leave you a final advice, to avoid this situation make sure your data is converted before using the SCD component. something like the following:

http://rpmachado.files.wordpress.com/2014/01/wpid-prop.png

Adding that conversion component to ensure data types consistency, your data flow should now look like the following:

http://rpmachado.files.wordpress.com/2014/01/wpid-finalllllll.png

Now, run the package and test yourself.

Thank you,

Rui Machado