In our ETL development, we don't always receive the data in the exact format that we expect it to be. For instance, there might be cases that the incoming data is a string value while we expect an integer, or in some other cases, there are special characters in it we didn’t realize, or otherwise there might be unexpected whitespaces. This last case is the focus of this blog post. For example, you might think a value is ‘StringValue’, whereas it is actually ‘ StringValue’, or ‘StringValue ‘, with a few leading or trailing spaces. This can occur when the data source forces a certain number of characters in a column, and pads the value in there with spaces to fill up the missing number of characters, and it is not very obvious when this happens during the ETL development, and it can cause frustrations. If we are counting on being able to match values between a source and these padded target values for lookup purpose, we will run into difficulties. Cleaning up the padded whitespaces would be the ideal solution if they are not part of the data, however, that may not always be the practical solution in an already defined ETL process, particularly when the padded values are saved in the target system. In this blog post, we will discuss the use case, and we will show you how to use our Premium Lookup component to perform the lookup without having to make actual changes to the source or target system data.
Use case
Let’s look at what might happen if we try to do a lookup in this case. For this purpose, we will use our Premium Lookup. As our source, we have a single record with a string field, firstname_unpadded = ‘Testctc’. In our lookup target, the value is ‘ Testctc’, which has several leading spaces. We feed our source and our lookup target into the Premium Lookup component and configure the Matched Rows and Unmatched Rows output.
To configure the Premium Lookup component, we simply map the input field to the lookup field. Here we have mapped the unpadded input field to the padded lookup, and keep the default “Match Type” as “Exact Match”:
In the Columns page, we can select these two columns to pass on:
When we execute this Data Flow now, we see that the values don’t match, as expected. The record is in the “Unmatched Rows” output.
In this case, there are a couple of options that we will look at: Removing padding from target, or Exact Lookup with Whitespace trimmed.
- Remove padding from Target values - This solution assumes that the target values should not be padded, and would be equivalent to cleaning your data, whether it be in a database, a file, some ERP system, etc. You can use your method of choice for removing the padded whitespace. This is the recommended solution, assuming that the configuration allows for the padding to be trimmed and that the padding does not belong.
- Exact Lookup with Whitespace trimmed - If for some reason you cannot trim the actual target values, then you can use the Premium Lookup component to compare the input to trimmed versions of the target values. For this we again use “Exact Match”, but this time click on the “Comparison Setting” ellipsis to open up the Settings window. Here select “Trim Leading and Trailing Whitespace” and click OK.
When using the above feature shown in the second option, the Premium Lookup component will perform an in-memory trimming of the source data and target data before comparison. When we execute the package, the record will be matched regardless whether the source data or target data is padded or not. You can see that the unpadded firstname from the input Source was successfully matched to the padded target firstname after we have selected the “Trim Leading and Trailing Whitespace” option. This has greatly simplified the ETL process without having to actually modify the source or the target system data.
Conclusion
The Premium Lookup component that comes along with our SSIS Productivity Pack can be used in many such scenarios where the data needs to be matched, but has anomalies in it. There are different match types that could be used for this purpose. You could merge and join the inputs and have a control on how exactly the match needs to be done, and what outputs are to be produced.