Using the Premium PDF Source
The Premium PDF Component is an SSIS Source component that can be used to extract data from tables in PDF files. There are two pages that can be configured to read from PDF files using the component:
- General
- Columns
General Page
The General page of the Premium PDF Source allows you to specify the general settings of the component.
- Source File Settings
-
- Connection Manager
-
The Premium PDF Source component requires a connection to connect to the PDF File. The Connection Manager drop-down will show a list of all connection managers that are available for your current SSIS packages. The supported connection managers are listed below:
- Local File
- FTPS Connection Manager
- SFTP Connection Manager
- Amazon S3 Connection Manager
- Azure Blob Connection Manager
- Azure Data Lake Storage Connection Manager
- Azure Files Connection Manager
- Box Connection Manager
- Dropbox Connection Manager
- Hadoop Connection Manager
- Google Cloud Storage Connection Manager
- Google Drive Connection Manager(since v21.2)
- OneDrive Connection Manager
- WebDAV Connection Manager
- Google Cloud Storage Connection Manager
- SharePoint Connection Manager (offered with the SSIS Integration Toolkit for Microsoft SharePoint)
- Source File Path
-
The Source File Path specifies the location of the PDF file that you are trying to read from. Click the eclipse button ('...') to open a browser dialog to select an item.
- Password to Open
-
This option is used to specify the password to open the PDF file. If the PDF file is not encrypted, you can leave this field blank.
- Locale
-
Specify the Locale of the file.
- Configure Table Detection
-
- Combine Tables Strategy
-
This option can be used to specify how to combine tables across multiple pages from within the PDF file. The available options are:
- None
- CombineAcrossPages
- CombineAll
(since v21.1)
- Skip Tables at Start: This option specifies how many tables to exclude at the beginning of the file.
- Skip Tables at End: his option specifies how many tables to exclude at the end of the file.
- Skip Empty Rows
-
Enabling this option will skip empty rows in the table.
- Right Shift Misaligned Cells
-
This option allows you to choose to either Shift, Shift to End or Do Not Shift misaligned cells.
- Configure Source
-
- Locate Table Strategy
-
This option allows you to choose the table from the PDF file you want to extract data from.
- Column Header Row Index
-
This field is enabled when “Table Contains Column Names” is checked. The position of the Header index can be specified in this field.
- Table Contains Column Names
-
Enable this option to specify Column Header Row Index.
- Data Start Row Index
-
Specify the row index for the data start position in the table.
- Read to End
-
Enable this option to read till the end of the file.
- Max Number of Rows
-
When “Read to End” is unchecked, this field is active. It allows you to specify the number of rows to read.
- Refresh Component
-
By clicking the Refresh Component button, the component will retrieve the latest metadata from the PDF File you have specified. After clicking this button, you will receive a status message indicating how many fields have been updated, added, or deleted.
- Expression fx Button
-
Click the fx button to launch SSIS Expression Editor to enable dynamic updates of the property at run time.
Columns Page
The Columns page of the Premium PDF Source Component shows you all available attributes from the source that you specified on the General page.
On the top left of the grid, the checkbox can be used to toggle the selection of all available fields. This is a productive way to check or uncheck all available fields.
The Columns Page grid consists of:
- PDF Field: Column that will be retrieved from the PDF File.
- Data Type: The data type of this field.
- Properties window for the field listed
- Name: specify the column name.
- Data type: the data type can be changed accordingly.
- Length: Specify the Length of the fields.
- Column Header Name Or Index: Specify the Column header name
- Has Multiple Lines: Boolean field that can be chosen to specify if there are multiple lines in the field.
- + sign: Add field to PDF File.
- sign: Remove field from PDF File.
- Arrows: Move the fields to a desired location in the file.
Preview
The Preview button at the bottom of the Premium File Source component can be used to preview the table which was detected in your PDF file based on the configured settings.
- Combine Tables Strategy
-
This option can be used to specify how to combine tables across multiple pages from within the PDF file. The available options are:
- None
- CombineAcrossPages
- CombineAll
(since v21.1)
- Skip Tables at Start: This option specifies how many tables to exclude at the beginning of the file.
- Skip Tables at End: This option specifies how many tables to exclude at the end of the file.
- Skip Empty Rows
-
Enable to skip empty rows.
- Right Shift Misaligned Cells
-
This option can be chosen appropriately to either Shift, Shift to end, or Do Not Shift misaligned cells.
- Multi-line Date Columns
-
Select the columns that are multi-line fields.
- Apply Settings to Component
-
This option will apply the settings that are changed in the Preview page to the component.
- Close
-
Closes the Preview page.