Using the HTML Table Source Component
The HTML Table Source Component is a source component used to retrieve an HTML document from an HTTP request or a local file and extract a table element into column data. Output columns can be automatically or manually generated. For each HTML table column, 4 SSIS output columns can be configured depending on the data that needs to be extracted:
- Text
- HTML
- Links
- Images
General Page
The General page determines where your HTML document will come from. The connection manager property can either be set to an HTTP Connection Manager, or a local file. This page also specifies what table to use for extraction and how to extract data from it.
- General Settings
-
- Connection Manager
-
The HTML Table Source Component requires a connection in order to parse data from HTML Table. The Connection Manager drop-down will show a list of all connection managers that are available to your current SSIS package.
Note the following Connection Managers are supported:
- Local File
- HTTP Connection Manager
- <<File Content in Variable>>(since v21.1)
- Local File Path
-
The path on the file system to the HTML document that will be extracted.
- Input Variable
-
This option allows you to select from a drop-down list an SSIS variable or parameter to which your package has access.
- Get Table By
-
Specifies how to uniquely identify the table element to use for extraction. There are 4 different ways to do this:
- Class: specify the class name and the position of the HTML table to use for extraction.
- Id: specify the id of the HTML table to use for extraction.
- Position: specify the position of the HTML table to use for extraction
- XPath: specify an XPath expression pointing to the HTML table to use for extraction
- Advanced Settings
-
- Header Row Position
-
The position of the row in the HTML table to use as the header row. If the table has no header rows specify 0.
- Top Rows To Skip
-
These rows at the top of the HTML table (after the header row position) will be ignored.
- Bottom Rows To Skip
-
These rows at the bottom of the HTML table will be ignored.
- Trim Whitespace
-
This option will tell the component to trim whitespace from '_Text' columns.
Note: '_Text' columns are decoded using HTML decoding, however, whitespace trimming is taken place after decoding. For example, if the HTML table had a cell with the content of ' Hello World ' and trim whitespace was enabled the output would be 'Hello World'
- Refresh Columns
-
This will populate the output columns using the settings on the General Page.
- Expression fx icon
-
Click the blue fx icon to launch SSIS Expression Editor to enable dynamic updates of the property at run time.
- Generate Documentation Button
-
Click the Generate Documentation icon to generate a Word document that describes the component's metadata including relevant mapping, and so on.
Columns Page
The Columns page allows the user to add/remove/move columns. Each row in the data grid view represents 1 column in the HTML table. Notice how each row has 4 checkbox cells, these represent the following output columns:
- Text: This adds an output column with the name of the HTML table column suffixed with '_Text'. This column contains the inner text of the HTML table column (no tags). The name and data type properties can be adjusted in the property grid.
- HTML: This adds an output column with the name of the HTML table column suffixed with '_HTML'. This column contains all of the HTML inside the HTML table column.
- Links: This adds an output column with the name of the HTML table column suffixed with '_Links'. This column contains a list of href values inside anchor (<a>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Link position in the property grid.
- Images: This adds an output column with the name of the HTML table column suffixed with '_Images'. This column contains a list of src values inside image (<img>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Image position in the property grid.
- The Columns page grid consists of:
-
- Column Name: Column that will be retrieved.
- Properties window for the field listed
- Name: Specify the column name.
- Data type: The data type can be changed according.
- Length: Specify the Length of the fields. If the data type specified is a string, the length specified here would be the maximum size. If the data type is not a string, the length will be ignored.
- Precision: Specify the number of digits in a number.
- Scale: Specify the number of digits to the right of the decimal point in a number.
- CodePage: Specify the Code Page of the field.
- Link Position: Specify the position of the link value inside the list of href values.
- Image Position: Specify the position of the image value inside the list of src values.