Using Premium Hash for Efficient and Easy Change Tracking in SSIS

28 August 2020
Chen Huang

It is always crucial for an ETL process to have a change tracking mechanism to identify when a change has happened to any set of data and only process data that has changed.

There are several ways to capture data changes. The best scenario would be with a CDC (Change Data Capture) incorporated in your source system. The most common approach utilizes a timestamp column to identify which data has changed since the last pull when your source system does not have the CDC feature. In situations where the source data does not have a reliable timestamp column, an alternative solution is to perform a field-by-field comparison. This can be done using our Diff Detector component offered in our SSIS Productivity Pack. However, this field-by-field comparison approach in a large dataset can be time-consuming. To resolve this challenge, we have recently launched our Premium Hash component that can efficiently track data changes.

First, let's have a quick look at the Hash function. 

What is Hash?

Hashing is a technique to transform a string into a value or key with a fixed length that represents the original string. The hash string is completely different, even when there is a single character difference in the original string. For a given piece of data, you’ll always get the same hash but you can’t turn a hash back into its original string.

Given the nature of hash, hashing is typically used in cryptographic applications like digital signatures. When it comes to the ETL world, hashing can be used for many purposes, like identifying duplicates from the source system, generating seemingly random strings, etc. In this blog post, we will use hashing to track data changes. 

Premium Hash Component Overview

Premium Hash component is an SSIS data flow transformation component that can be used to hash any input data.

 Premium Hash component

It comes with 2 scopes: Row Level scope and Column Level scope. Row Level scope cascades the column values and generate the hash string for each input row, and Column Level hash generates the hash string for each column value which column has been specified in the Premium Hash component. There are 12 hash algorithms available in Premium Hash component, which covers the commonly used hashing algorithms.

Change Tracking with Premium Hash component

It's a very simple data flow design like this when using Premium Hash component to track data changes in one Source:

SSIS Data Flow Design for Change Tracking with Premium Hash component

The first Premium ADO.NET Source component reads data from the source table, which has a “_checksum” field that is used to store the hash value.

SSIS Premium ADO.NET Source component with Checksum Field

Then we can configure the Premium Hash to generate the hash value. Simply specify the Hash Algorithm and add all the columns that you want to check if the data has changed. The Premium Hash component will cascade the input values and generate the hash string.

SSIS Premium Hash Editor

After we get the hash string, we can quickly determine the changed records by using a Conditional Split component.

  • Newly created records: ISNULL( [_checksum] )
  • Updated records: !ISNULL( [_checksum] )&& (_checksum != RowLevelHash)
  • Changed records (include newly created and updated): ISNULL(_checksum) || (_checksum != RowLevelHash)

In this example, we use the third condition to get all the changed records, including the newly created and updated records. We sync the data to the target table using our Premium ADO.NET Destination component with Upsert action. Upsert is a combination of Update and Insert, which makes data synchronization simple and easy with a few clicks.

SSIS Image 04 - Premium ADO.NET Destination component with Upsert Action

In the last step, we update the [_checksum] value in the source table with the latest hash value.

The sample data flow above shows how the Premium Hash component works to track changes in one source. Moreover, we can also use the Premium Hash component along with Diff Detector to detect changes from two sources. In a previous blog post, we covered a simple data flow for Incremental Load in SSIS with Diff Detector. In that example, we were performing field-by-field comparison, which doesn’t perform well when you have a lot of fields to compare. For optimal design, we can improve the performance significantly by using the Premium Hash component.

The data flow would look like this:

SSIS Image 05 - Data Flow Design for Detecting Changes from Two Sources

With the Premium Hash component used in the data flow, we only need to compare the hash string with Diff Detector. Thus the performance will be much improved since we are only comparing a string value.

SSIS Diff Detector Component Editor

Closing Note

In this blog, we demonstrated how you can use our Premium Hash component to track data changes in one source and in two sources. To summarize, our Premium Hash component offers extreme flexibility to work with different data types and optimized performance for data change tracking. All the components used in this blog post are available in our SSIS Productivity Pack. We hope this can provide you with some insights for your SSIS development.

Archive

November 2024 3 October 2024 1 September 2024 1 August 2024 2 July 2024 1 June 2024 1 May 2024 1 April 2024 2 March 2024 2 February 2024 2 January 2024 2 December 2023 1 November 2023 1 October 2023 2 August 2023 1 July 2023 2 June 2023 1 May 2023 2 April 2023 1 March 2023 1 February 2023 1 January 2023 2 December 2022 1 November 2022 2 October 2022 2 September 2022 2 August 2022 2 July 2022 3 June 2022 2 May 2022 2 April 2022 3 March 2022 2 February 2022 1 January 2022 2 December 2021 1 October 2021 1 September 2021 2 August 2021 2 July 2021 2 June 2021 1 May 2021 1 April 2021 2 March 2021 2 February 2021 2 January 2021 2 December 2020 2 November 2020 4 October 2020 1 September 2020 3 August 2020 2 July 2020 1 June 2020 2 May 2020 1 April 2020 1 March 2020 1 February 2020 1 January 2020 1 December 2019 1 November 2019 1 October 2019 1 May 2019 1 February 2019 1 December 2018 2 November 2018 1 October 2018 4 September 2018 1 August 2018 1 July 2018 1 June 2018 3 April 2018 3 March 2018 3 February 2018 3 January 2018 2 December 2017 1 April 2017 1 March 2017 7 December 2016 1 November 2016 2 October 2016 1 September 2016 4 August 2016 1 June 2016 1 May 2016 3 April 2016 1 August 2015 1 April 2015 10 August 2014 1 July 2014 1 June 2014 2 May 2014 2 February 2014 1 January 2014 2 October 2013 1 September 2013 2 August 2013 2 June 2013 5 May 2013 2 March 2013 1 February 2013 1 January 2013 1 December 2012 2 November 2012 2 September 2012 2 July 2012 1 May 2012 3 April 2012 2 March 2012 2 January 2012 1

Tags