It is always crucial for an ETL process to have a change tracking mechanism to identify when a change has happened to any set of data and only process data that has changed.
There are several ways to capture data changes. The best scenario would be with a CDC (Change Data Capture) incorporated in your source system. The most common approach utilizes a timestamp column to identify which data has changed since the last pull when your source system does not have the CDC feature. In situations where the source data does not have a reliable timestamp column, an alternative solution is to perform a field-by-field comparison. This can be done using our Diff Detector component offered in our SSIS Productivity Pack. However, this field-by-field comparison approach in a large dataset can be time-consuming. To resolve this challenge, we have recently launched our Premium Hash component that can efficiently track data changes.
First, let's have a quick look at the Hash function.
What is Hash?
Hashing is a technique to transform a string into a value or key with a fixed length that represents the original string. The hash string is completely different, even when there is a single character difference in the original string. For a given piece of data, you’ll always get the same hash but you can’t turn a hash back into its original string.
Given the nature of hash, hashing is typically used in cryptographic applications like digital signatures. When it comes to the ETL world, hashing can be used for many purposes, like identifying duplicates from the source system, generating seemingly random strings, etc. In this blog post, we will use hashing to track data changes.
Premium Hash Component Overview
Premium Hash component is an SSIS data flow transformation component that can be used to hash any input data.
It comes with 2 scopes: Row Level scope and Column Level scope. Row Level scope cascades the column values and generate the hash string for each input row, and Column Level hash generates the hash string for each column value which column has been specified in the Premium Hash component. There are 12 hash algorithms available in Premium Hash component, which covers the commonly used hashing algorithms.
Change Tracking with Premium Hash component
It's a very simple data flow design like this when using Premium Hash component to track data changes in one Source:
The first Premium ADO.NET Source component reads data from the source table, which has a “_checksum” field that is used to store the hash value.
Then we can configure the Premium Hash to generate the hash value. Simply specify the Hash Algorithm and add all the columns that you want to check if the data has changed. The Premium Hash component will cascade the input values and generate the hash string.
After we get the hash string, we can quickly determine the changed records by using a Conditional Split component.
- Newly created records: ISNULL( [_checksum] )
- Updated records: !ISNULL( [_checksum] )&& (_checksum != RowLevelHash)
- Changed records (include newly created and updated): ISNULL(_checksum) || (_checksum != RowLevelHash)
In this example, we use the third condition to get all the changed records, including the newly created and updated records. We sync the data to the target table using our Premium ADO.NET Destination component with Upsert action. Upsert is a combination of Update and Insert, which makes data synchronization simple and easy with a few clicks.
In the last step, we update the [_checksum] value in the source table with the latest hash value.
The sample data flow above shows how the Premium Hash component works to track changes in one source. Moreover, we can also use the Premium Hash component along with Diff Detector to detect changes from two sources. In a previous blog post, we covered a simple data flow for Incremental Load in SSIS with Diff Detector. In that example, we were performing field-by-field comparison, which doesn’t perform well when you have a lot of fields to compare. For optimal design, we can improve the performance significantly by using the Premium Hash component.
The data flow would look like this:
With the Premium Hash component used in the data flow, we only need to compare the hash string with Diff Detector. Thus the performance will be much improved since we are only comparing a string value.
Closing Note
In this blog, we demonstrated how you can use our Premium Hash component to track data changes in one source and in two sources. To summarize, our Premium Hash component offers extreme flexibility to work with different data types and optimized performance for data change tracking. All the components used in this blog post are available in our SSIS Productivity Pack. We hope this can provide you with some insights for your SSIS development.