Using the Data Profiler Component
The Data Profiler Component is an SSIS data flow pipeline component that can be used to analyze data and compare rows from upstream data sources. Rows from any inputs will be passed through the component to corresponding outputs and once all the rows have been processed the component will output a single row to the "DataProfiler Output" with the results of data analysis. Some profile types will create an additional output that will provide rows of relevant data.
The component includes one page to configure how you want to profile data.
Data Profiler Configuration
The data profiler configuration page allows you to add and configure profiles to the data profiler.
- Add button
-
The Add button will add a new profile to the profile list.
- Remove button
-
Remove the selected profile from the profile list.
- Profile List - Profile Category column
-
The category of the profile. This property filters the list of profile types available in the Type column.
- Profile List - Profile Type column
-
The type of profile. This property determines what type the profile will be, and changes the properties in the profile properties grid to match the properties available for that profile type.
- Profile List - Output / Output Column Name column
-
The name of the column in the DataProfilerOutput or the output where the data from the profile will be sent. Some profiles output to a column in the DataProfilerOutput and some profiles have their own output.
- Add Preconfigured Profile button
-
The Add Preconfigured Profile button allows common sets of profiles to be added more easily.
After selecting the profile type one or more input columns will need to be selected. Using this information several profiles will be generated, configured and added to the component.
- Expression fx Button
-
Clicking the fx button to launch SSIS Expression Editor to enable dynamic updates of the property at run time.
- Generate Documentation Button
-
Clicking the Generate Documentation button to generate a Word document that describes the component's metadata including relevant mapping, and so on.
Profile Type Summary
Statistics Profiles
- Standard Deviation: The standard deviation of a numeric column.
- Min Value: The minimum value of a numeric or date column.
- Max Value: The maximum value of a numeric or date column.
- Mean Value: The mean value of a numeric column.
- Count: The total number of rows.
Column Length Profiles
- Column Length Min: The minimum length of a string column.
- Column Length Max: The maximum length of a string column.
- Column Length Distinct: The number of distinct string column lengths in the data.
- Column Length Summary: A new output that lists all the distinct lengths that were found, and how many of each were found.
Null Value Profiles
- Null Value Count: The number of null values in the column.
- Not Null Count: The number of not null values in the column.
- Null Value Ratio: The ratio of null values to not null values. A value of 1 would mean every row is null, a value of 0 would mean there were no null values.
Column Pattern Profiles
- Column Pattern Match Count: The number of rows that match the specified regular expression.
- Column Pattern Match Ratio: The ratio of rows that match the specified regular expression. A value of 1 would mean every row matched, a value of 0 would mean that no rows matched.
- Column Pattern Match Violations: A new output that lists all of the values that violated the pattern.
Column Value Distribution
- Column Value Distinct: The number of distinct values in the column.
- Column Value Distinct Ratio: The ratio of distinct rows to total rows. A value of 1 would mean every row was distinct, a value of 0 would mean that every row was the same.
- Column Value Summary: A new output that lists all of the distinct column values and their counts.
Candidate Key Profiles
- Is Candidate Key: Determines if the selected columns could form a key. This will be true if the composite key formed by the data in each row is unique.
- Distinct Key Count: The number of distinct composite keys formed by the selected columns.
- Distinct Key Ratio: The ratio of distinct keys to the total number of keys.
- Key Count: The total number of rows in the selected columns.
- Candidate Key Violations: A new output that lists all of the duplicate keys that were found, and how many times each duplicate was found.
Functional Dependency Profiles
- Is Functional Dependency: Determines if the selected dependency columns are functionally dependent on the determinant columns. For example, if the dependency column was Zip Code and the determinate column was State then every unique zip code should have the same state in its row, any rows where this was not the case would be a violation.
- Dependency Violation Count: The number of dependency violations in the data.
- Dependency Violation Ratio: The ratio of dependency violations to the total number of rows A value of 1 would mean that every row was a violation, a value of 0 would mean there were no violations.
- Row Count: The total number of rows in the selected columns.
- Dependency Summary: A new output that lists all the unique determinent columns found for each group of dependency columns. The most frequently encountered determinent columns are considered to be the correct dependent columns, any less frequently occurring determinant columns are considered violations.
Value Inclusion Profiles
- Value Inclusion Count: The total number of rows where the input row value was found in the lookup column.
- Value Inclusion Count: The ratio of the number of value inclusions to the total rows. A value of 1 would mean that every input row was found in the lookup column, a value of 0 would mean that none of the input rows were found in the lookup column.
- Value Inclusion Violations: A new output that lists the items in the input column that were not found in the lookup table.
Profile Input Type Summary
Different profiles are compatible with different input types. The 4 relevant types are: String, Number, Date, and Object (anything else).
Profile Name | Number | String | Date | Object |
---|---|---|---|---|
StandardDeviation | yes | no | no | no |
MinValue | yes | no | yes | no |
MaxValue | yes | no | yes | no |
MeanValue | yes | no | no | no |
Count | yes | yes | yes | yes |
ColumnLengthMin | no | yes | no | no |
ColumnLengthMax | no | yes | no | no |
ColumnLengthDistinct | no | yes | no | no |
ColumnLengthSummary | no | yes | no | no |
NullValueRatio | yes | yes | yes | yes |
NullValueCount | yes | yes | yes | yes |
NotNullCount | yes | yes | yes | yes |
ColumnPatternMatchCount | no | yes | no | no |
ColumnPatternMatchRatio | no | yes | no | no |
ColumnPatternViolations | no | yes | no | no |
ColumnValueDistinct | yes | yes | yes | yes |
ColumnValueDistinctRatio | yes | yes | yes | yes |
ColumnValueSummary | yes | yes | yes | yes |
IsCandidateKey | yes | yes | yes | yes |
DistinctKeyCount | yes | yes | yes | yes |
DistinctKeyRatio | yes | yes | yes | yes |
KeyCount | yes | yes | yes | yes |
CandidateKeyViolations | yes | yes | yes | yes |
ValueInclusionRatio | yes | yes | yes | yes |
ValueInclusionCount | yes | yes | yes | yes |
ValueInclusionViolations | yes | yes | yes | yes |
IsFunctionalDependency | yes | yes | yes | yes |
DependencyViolationCount | yes | yes | yes | yes |
DependencyViolationRatio | yes | yes | yes | yes |
RowCount | yes | yes | yes | yes |
DependencySummary | yes | yes | yes | yes |