Using the Premium Data File Source Component
The Premium Data File Source Component is an SSIS data flow pipeline component that can be used to read/retrieve data from an Avro, ORC, or Parquet file.
The component includes the following three pages to configure how you want to read the data:
- Data Source
- Document Designer
- Columns
Data Source
The Data Source page of the Premium Data File Source Component allows you to specify the general settings of the component.
- Connection Properties
-
- Connection Manager
-
The Premium Data File Source Component requires a connection in order to connect to a File. The Connection Manager drop-down will show a list of all connection managers that are available to your current SSIS package.
This component will support the following connection managers:
- Local File
- FTPS Connection Manager
- SFTP Connection Manager
- Amazon S3 Connection Manager
- Azure Blob Connection Manager
- Azure Data Lake Storage Connection Manager
- Azure Files Connection Manager(since v20.1)
- Box Connection Manager
- Dropbox Connection Manager
- Google Cloud Storage Connection Manager
- Google Drive Connection Manager (since v21.2)
- Hadoop Connection Manager
- OneDrive Connection Manager
- SharePoint Connection Manager (offered with the SSIS Integration Toolkit for Microsoft SharePoint)
- WebDAV Connection Manager (since v20.1)
- File Format
-
This option lets you choose the file format that needs to be read. The following are the currently supported file formats:
- Avro
- ORC
- Parquet
- Source File Path
-
The Source File Path specifies the location of the file that you are trying to read from.
- Expression fx Icon
-
Click the blue fx icon to launch SSIS Expression Editor to enable dynamic updates of the property at run time.
- Generate Documentation Icon
-
Click the Generate Documentation icon to generate a Word document that describes the component's metadata including relevant mapping, and so on.
Document Designer
The Document Designer page allows you to build the design of the document you are trying to read, or import the design from an existing document.
The Document Designer includes the following two tabs:
- Details View
- Additional Settings
In the Details View tab, the top part of the page is used to manually configure the nodes in the design:
- Add Node: This button will add a new node to your Document design.
- Remove Nodes: This button will remove a node from your Document design.
- Direction buttons: These buttons can be used to rearrange the position of the nodes.
- Rename Nodes: This option allows you to specify how the node name should be represented.
-
- Use Qualified Names: When this option is selected, the output/column name will be set to the full qualified node name based on the node location in the document.
- Use Short Names: When this option is selected, the output/column name will be set to the given Node Name directly.
-
Filter Columns: This option allows you to show or hide certain Columns in the grid.
- Show Basic Columns: When this option is selected, only basic columns will be shown in the grid.
- Show All Columns: When this option is selected, all available columns will be shown in the grid.
- Show Basic Columns: When this option is selected, only basic columns will be shown in the grid.
- Filter Nodes: This option allows you to filter the list of nodes shown in the grid by typing a keyword in the textbox.
The Details View grid consists of:
-
Node Type: This option allows you to specify the type of the Node in your document design. There are four options available:
- Array.
- Object
- Value.
- Raw: This type can be used when trying to retrieve data under a node exactly as it is in the document.
- Array.
- Node Name: The Name of the Node in the document.
- Output/Column Name: The name which will be set for the output or the column of a node.
- Is Repeated: This option allows you to specify if a node is repeated within a document. (Available when Show All Columns is selected)
- Output type: The type of output for a node such as a Column or a Secondary Output depending on the Node Type.
- Output Settings: This option allows you to specify the settings of each output such as the datatype of Value Node Types.
In the Additional Settings tab, you would find the following options:
-
'Is Repeated' Text Qualifier: This option allows you to specify the
Text Qualifier used in a document when the
Is Repeated property is set to
True for one or more nodes. There are four options available:
- Double-quote(“).
- Single-quote (‘).
- Tick (`).
- None.
-
'Is Repeated' Text Delimiter: This option allows you to specify the
Text
Delimiter used in a document when the
Is Repeated property is set to
True for one or more nodes. There are seven options available:
- Newline (\n).
- Carriage Return (\r).
- Semicolon (;).
- Colon (:).
- Comma (,).
- Tab (\t).
- Vertical Bar (|).
- Import
-
This option allows you to import the design of your document from one of the following three sources:
- Designer Settings: Import the design from an existing .designer.settings file.
- Connection Manager: Import the design based on the retrieved document from the connection manager.
- Local File: Import the design based on a file on your local file system.
- Designer Settings: Import the design from an existing .designer.settings file.
- Export
-
Designer Settings: This option allows you to export the current document design to a .designer.settings file which can be used later to import the same design in a different component.
Columns Page
The Columns page of the Premium Data File Source Component shows you the available columns based on the settings on the Document Designer page.
-
On the top left of the grid, you can see a checkbox, which can be used to toggle the selection of all available fields. This is a productive way to check or uncheck all available fields. The Columns Page grid consists of:
- Include Field Checkbox: A checkbox that determines if the field will be available as an output column.
- Column Name: Column that will be retrieved from the document.
- Data Type: The data type of this field.
- Hide Unselected Fields
-
When the Hide Unselected Fields checkbox is checked unselected output columns will be hidden.
- Hide Selected Fields
-
When the Hide Selected Fields checkbox is checked used selected columns will be hidden.
- Filter
-
The output columns that are visible can be filtered by entering text in the Filter text box.
Note: As a general best practice, you should only select the fields that are needed for the downstream pipeline components. Do this on the columns page using the checkboxes or on the General page by removing the column from the command entirely.
Avro
Premium Data File Source component imports Avro data types to the following output types by default.
Primitive Types
Avro Primitive Type | Output Type |
---|---|
null |
DT_WSTR(255) - nvarchar |
boolean |
DT_BOOL - boolean |
int |
DT_I4 - int |
long |
DT_I8 - bigint |
float |
DT_R4 - real |
double |
DT_R8 - float |
bytes |
DT_IMAGE - image |
string |
DT_WSTR(255) - nvarchar |
Complex Types
Avro Complex Type | Output Type | Notes |
---|---|---|
enum |
DT_WSTR(255) - nvarchar |
|
fixed |
DT_IMAGE - image |
|
array of primitive type items |
DT_NTEXT - ntext |
Avro array will be converted into a repeated node in Premium Data File Source, the output value is delimited text. Note: Array of Array is not currently supported. |
map |
Child Output |
Premium Data File Source detects an Avro map field to a child output that contains an array of key-value pairs. |
record |
Object |
|
array of record items | Child Output | |
union["null", "{avro_primitive_type}"] |
For Avro union type field only has one non-null type, it will be detected as the corresponding SSIS data type in the Premitve Type Conversion table above. |
|
union["{avro_primitive_type1}", "{avro_primitive_type2}"...] |
DT_WSTR(255) - nvarchar |
NOTE: Avro logical types are not currently supported, logicalType attribute in the Avro file schema would be ignored.
ORC
Premium Data File Source component imports ORC data types to the following output types by default.
ORC Data Type | Output Type |
---|---|
boolean |
DT_BOOL - boolean |
binary |
DT_IMAGE - image |
byte |
DT_UI1 - byte |
short |
DT_I2 - smallint |
int |
DT_I4 - int |
long |
DT_I8 - bigint |
decimal |
DT_DECIMAL - decimal |
float |
DT_R4 - real |
double |
DT_R8 - float |
string |
DT_WSTR(255) - nvarchar |
varchar |
DT_WSTR(255) - nvarchar |
char |
DT_WSTR(255) - nvarchar |
date |
DT_DBDATE - date |
timestamp |
DT_DBTIMESTAMP2 - datetime |
NOTE: Premium Data File Source component currently does not support ORC complex structure.
Parquet
Premium Data File Source component imports Parquet data types to the following output types by default.
Primitive Types
Parquet Data Type | Parquet Annotation | Output Type |
---|---|---|
BOOLEAN |
DT_BOOL - boolean |
|
INT32 |
DT_I4 - int |
|
INT32 |
INT_16 |
DT_I2 - smallint |
INT32 |
UNIT_8 |
DT_I1 - tinyint |
INT32 |
UNIT_16 |
DT_UI2 - smallint (unsigned) |
INT64 |
|
DT_I8 - bigint |
INT96 |
|
DT_DBTIMESTAMP - datetime |
FLOAT |
|
DT_R4 - real |
DOUBLE |
|
DT_R8 - float |
BYTE_ARRAY |
|
DT_IMAGE - image |
BYTE_ARRAY |
DECIMAL |
DT_DECIMAL - decimal |
BYTE_ARRAY |
UTF8 |
DT_WSTR(255) - nvarchar |
Complex Types
Parquet Complex Type | Output Type | Notes |
---|---|---|
ARRAY of primitive type items |
DT_NTEXT - ntext |
Parquet array will be converted into a repeated node in Premium Data File Source, the output value is delimited text. Note: ARRAY of ARRAY is not currently supported. |
MAP | Child Output | Premium Data File Source detects a Parquet MAP field to a child output that contains an array of key-value pairs. |
STRUCT | Object | |
ARRAY of STRUCT items | Child Output |