Using the Databricks Source Component
The Databricks Source Component is an SSIS data flow pipeline component that can be used to read/retrieve data from a Databricks instance.
The component includes the following two pages to configure how you want to read data:
- General
- Columns
General Page
The General Page of the Databricks Source Component allows you to specify the general settings of the component.
- Connection Manager
-
The Databricks Source Component requires a Databricks connection manager.
- Object Type
-
The Object Type allows you to choose whether you want to read data from a Databricks table or use a snippet of a query. Using the Table option, allows you to have the ease of use of visually picking the Catalog, Schema, Fields, that you want to read from the Databricks instance. When Query is used, you can take advantage of the power and flexibility of Databricks query, such as JOIN, filtering criteria, etc. Your business requirements should influence which option to choose.
- Warehouse
-
The Warehouse option is a drop-down list that allows you to pick one of the Databricks warehouses available in your Databricks instance.
- Catalog
-
The Catalog drop-down menu displays a list of available catalogs in the Databricks instance defined in the Connection Manager. Selecting a catalog here will automatically populate the Schema drop-down list.
- Schema
-
The Schema drop-down menu displays a list of available Schemas in the previously selected Snowflake database. Selecting a schema here will automatically populate the Table drop-down list.
- Table
-
The Table drop-down menu displays a list of available tables from the schema specified in the previously selected database schema. Selecting a table here will automatically populate the Command property. This is a quick and easy way to generate a basic select statement for reading from the database.
- Parameterize Command(Available for QUERY object Type)
-
Preparing commands protects against SQL injection when using the custom command. Only disable this option when it is required, and you have absolute control over the values being merged into the command.
- Command(Available for QUERY object Type)
-
The Command textbox is the command text that will be executed over the Connection Manager to read data from Databricks. A basic select statement can easily be generated by selecting a table from the Data Source property. You can then further customize the command to your liking to perform powerful queries.
The Command textbox supports the use of User and System Variables. Simply select a variable under the Insert Variables drop-down menu, and a placeholder value will be inserted into the filter text.
- Import
-
Loads SQL from a file into the Command property.
- Export
-
Save the SQL in the Command property to a file.
- Preview
-
Opens a preview dialog that shows the result (up to the first 100 rows) of executing the text in the Command property over the specified Connection Manager. Note if the command makes any changes to the database the changes will appear in the preview but are rolled back immediately. Changes to the database will only commit at runtime.
- Use External Link
-
Using EXTERNAL LINKS allows you to fetch large result sets efficiently.
- Refresh Component Button
-
Clicking the Refresh Component button causes the component to retrieve the latest metadata and update each field to its most recent metadata.
- Expression fx Button
-
Clicking the fx button to launch SSIS Expression Editor to enable dynamic updates of the property at run time.
- Generate Documentation Button
-
Clicking the Generate Documentation button generates a Word document that describes the component's metadata including relevant mapping, and so on.
Columns Page
The Columns page of the Databricks Source Component shows you the available columns based on the settings on the General page.
On the top left of the grid, you can see a checkbox, which can be used to toggle the selection of all the available fields. This is a productive way to check or uncheck all available fields.
- Databricks Field
-
This column lists the names of the fields available from the data source (or command).
- Data Type
-
This column shows the datatype of the respective field.
Note: As a general best practice, you should only select the fields that are needed for the downstream pipeline components. Do this on the columns page using the checkboxes or on the General page by removing the column from the command entirely.