Amazon S3 OCF Connector: Install and Configure¶
Alation Cloud Service Applies to Alation Cloud Service instances of Alation
Customer Managed Applies to customer-managed instances of Alation
Metadata extraction from an Amazon S3 file system source uses the inventory reports for the Amazon S3 buckets. Alation extracts the inventory reports in the destination bucket and streams the metadata to the catalog.

Required Information¶
The following information is required for configuring the Amazon S3 file system source in Alation:
Authentication information (see Authentication).
Bucket to store the inventory for Alation—This destination bucket is created to store the inventory for performing the metadata extraction (see Configure Inventory).
Lambda function (Optional)—Setting up of the lambda (inventory) function is required only if the user is going to use incremental extraction (Incremental MDE Setup).
Authentication¶
The OCF connector for Amazon S3 supports several authentication methods:
Basic Authentication¶
Basic authentication requires an AWS IAM user and the access key ID and secret access key for this user.
To use basic authentication to connect from Alation, create an AWS IAM user account for Alation and save the values of the access key and secret access key to a secure location.
Grant the IAM user the required permissions (see Permissions for IAM User Account below).
Permissions for IAM User Account¶
Give the user read-only access to the source buckets (buckets that store the data) and the destination bucket (created for the inventory).
Note
If you do not require incremental MDE, you can provide read-only access to the destination bucket only.
For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.
You can set resources according to your need and keep only destination bucket and source buckets as shown in the below example. Refer to Amazon S3 documentation for more information.
Example:
{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}
STS Authentication with an IAM User¶
STS User¶
Note
File sampling is not supported for STS authentication.
Perform the following steps to create a STS user:
Create a role in AWS IAM and assign the policy with the access to the following:
Read-only access to the source buckets and the destination bucket to set up a connection in Alation and perform the metadata extraction (MDE).
Note
If you do not require incremental MDE, you can provide read-only access to the destination bucket only.
For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.
Create a user in AWS IAM with the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", }, "Action": "sts:AssumeRole", "Resource": "{ARN_OF_THE_ROLE}" ] }Go to the Trust Relationship section in the role properties and add trusted entities as shown below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "{ARN_OF_THE_USER_CREATED_IN_STEP_2_OF_THIS_SECTION" }, "Action": "sts:AssumeRole" } ] }
STS Authentication with an AWS IAM Role¶
STS authentication with an AWS IAM role does not require an IAM user. This authentication method uses an instance profile that assumes a role allowing access to Amazon resources. This authentication method works for authenticating across AWS accounts.
Note
This authentication method is available with connector version 3.8.5.6552 or newer.
To configure STS authentication with an AWS IAM role, use the steps in Configure Authentication via AWS STS and an IAM Role. To provide access to the data source via an IAM role, use the permissions information in Permissions for IAM User Account.
Configure Inventory¶
The destination bucket can either be created manually using the steps provided in this section or can be done using the Terraform script. Refer to Terraform Configuration.
Step 1: Create a Destination Bucket to Store Inventory¶
In S3, create a bucket to be used as the destination bucket by Alation. Create this bucket in the same AWS region as your source buckets. This bucket should not be used for any other purpose or store any other content. Alation expects that the destination bucket only stores the inventory reports to be ingested into the catalog.
Example:
alation-destination-bucket
Create a folder called inventory inside the destination bucket.
Step 2: Configure the S3 Inventory¶
For every source bucket that you want to be ingested into the catalog, configure the S3 inventory. Use the steps in Configuring inventory using the S3 console in AWS documentation to perform this configuration. Make sure to follow these recommendations:
Select the destination bucket with inventory folder path: s3://<Destination_Bucket_Name>/inventory
Set Frequency to Daily or Weekly. Make sure that the frequency matches with the metadata extraction schedule.
Set Output Format as CSV
Set Status as Enable
Set Server-side encryption to Enable/Disable. If you want to use server-side encryption, set it to Enable.
Additional fields (Optional)
Size
Last modified
Note
Additional fields are optional. MDE will not fail even if the additional fields are not configured.
Important
It may take up to 48 hours to deliver the first report into the destination bucket. After the inventory reports for all the buckets that you want ingested in Alation have been delivered to your destination bucket, you can proceed with the configuration on the Alation side.
Server-Side Encryption Configurations¶
SSE-S3 and SSE-KMS encryption types are supported by Alation. If you are using any of these encryption methods, you need to perform additional configuration in Amazon S3. See Server-Side Encryptions documentation for more information.
Scale Amazon S3 Connector¶
You can use various approaches to scale the Amazon S3 OCF connector while setting it up. For more information, refer to Scale Amazon S3 OCF Connector.
Install the Connector¶
Alation On-Prem¶
Important
Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.
If this has not been done on your instance, install the Connector Manager: Install Alation Connector Manager.
Make sure that the connector Zip file which you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page. Refer to Manage Connector Dashboard for details.
Alation Cloud Service¶
Note
OCF connectors require Alation Connector Manager. Alation Connector Manager is available by default on all Alation Cloud Service instances and there is no need to separately install it.
Make sure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page: refer to Manage Connector Dashboard.
Create and Configure a New S3 File System Source¶
Log in to the Alation instance and add a new S3 source by clicking on Apps > Sources > Add > File System.
From the File System Type dropdown, select S3 OCF Connector.
Provide a Title for the file system and click on Add File System. You will be navigated to the Settings page of your new S3 OCF file system.

Access¶
On the Access tab, set the file system visibility as follows:
Public File System - The file system will be visible to all users of the catalog.
Private File System - The file system will be visible to the users allowed access to the file system by file system Admins.
Add new File System Admin users in the File System Admins section.
General Settings¶
Note
This section describes configuring settings for credentials and connection information stored in the Alation database. If your organization has configured Azure KeyVault or AWS Secrets Manager to hold such information, the user interface for the General Settings page will change to include the following icons to the right of most options:
By default, the database icon is selected, as shown. In the vault case, instead of the actual credential information, you enter the ID of the secret. See Configure Secrets for OCF Connector Settings for details.
Perform the configuration on the General Settings tab.
Connector Settings¶
Note
The proxy fields mentioned in the following table are applicable from connector version 3.7.0. This connector supports Basic Proxy and Auth Proxy modes.
Parameter |
Description |
---|---|
File System Connection |
|
Region |
Specify the AWS region. Example: us-east-1 To use FIPS endpoints for GovCloud, prefix fips- in region. Example: fips-us-east-1 |
Basic Authentication |
Default. Leave the Basic Authentication radio button selected if you are configuring basic authentication. |
STS Authentication |
Select the STS Authentication button if you are configuring STS authentication with an IAM user. If you are going to Configure STS Authentication with an AWS IAM Role, disregard the Basic Authentication and STS Authentication radio buttons. They will not apply. |
Proxy Host |
Specify the proxy host to access S3 via the proxy server. This optional field should be used only if S3 is connected using a proxy. This field is required for Basic Proxy and Auth Proxy modes. |
Proxy Port |
Specify the proxy port number. This optional field should be used only if S3 is connected using a proxy. This field is required for Basic Proxy and Auth Proxy modes. |
Proxy Username |
Specify the proxy username. This optional field should be used only if S3 is connected using a proxy. This field is required only for Auth Proxy mode. |
Proxy Password |
Specify the proxy username. This optional field should be used only if S3 is connected using a proxy. This field is required only for Auth Proxy mode. |
Configure Basic Authentication¶
If you selected the Basic Authentication radio button, specify the information in the Basic Authentication section. Save the values by clicking Save.
Refer to Basic Authentication for more information about this authentication method.
Parameter |
Description |
---|---|
AWS Access Key ID |
Provide the AWS access key ID of the IAM user with basic authentication access. Make sure that the IAM user has access to the destination bucket. |
AWS Access Key Secret |
Provide the AWS secret access key. |
Configure STS Authentication¶
If you selected the STS Authentication radio button, specify the information in the STS Authentication section. Save the values by clicking Save.
Refer to STS Authentication for more information about this authentication method.
Parameter |
Description |
---|---|
STS: AWS Access Key ID |
Provide the AWS access key ID of the IAM user with STS authentication access. Make sure that the IAM user has access to the destination bucket. |
STS: AWS Access Key Secret |
Provide the AWS secret access key. |
Role ARN |
Provide the IAM role to assume with the required permissions |
STS Duration |
Provide the duration of the role session. |
Region-Specific Endpoint |
Select the Region-Specific Endpoint checkbox to use regional endpoints for STS request. If this checkbox is unselected, then the global endpoints will be used for STS request. |
Configure STS Authentication with an AWS IAM Role¶
To use STS authentication with an AWS IAM role, specify the information in the IAM Role Authentication section of General Settings. Save the values by clicking Save.
Refer to STS Authentication with an AWS IAM Role for more information about this type of authentication.
Parameter
Description
Auth Type
Select AWS IAM.
Authentication Profile
Select the authentication profile you created in Admin Settings.
Role ARN
Provide the ARN of the role that gives access to the Amazon resource.
External ID
Provide the External ID you added to the role that gives access to the Amazon resource.
STS Duration
Provide the STS token duration in seconds. This value must be less than or equal to the Maximum session duration of the IAM role that provides access to the Amazon resource(s).
Logging Configuration¶
Parameter |
Description |
---|---|
Log Level |
Select the Log Level to generate logs. The available log levels are based on the log4j framework. |
Test Connection¶
Click Test to validate network connectivity.
Deleting the Data Source¶
You can delete your data source from the General Settings tab. Under Delete Data Source, click Delete to delete the data source connection.
Metadata Extraction¶
You can perform a full extraction or incremental extraction with the help of additional configuration on the Amazon S3 side.

Connector Settings¶
Metadata/Schema Extraction Configuration¶
Specify the Metadata/Schema extraction configuration settings and click Save:
Parameter |
Description |
---|---|
Destination Bucket Name |
Provide the name of the destination bucket that hosts the inventory reports. Note: The wait time is 24 to 48 hours for the first inventory report to be generated once the inventory function is set. If you run MDE before the inventory report generation then Alation will not extract any data. |
Incremental Sync |
Select this checkbox to extract only the new object additions or deletions. However, the first extraction is always a full metadata extraction. Important: Make sure that you configure the necessary configurations in S3 before you enable the Incremental Sync checkbox. The required S3 configuration can be done in two ways:
|
Schema Extraction Configuration¶
Specify the schema extraction configuration settings and click Save:
Parameter |
Description |
---|---|
CSV File Delimiter |
Select the CSV file delimiter within all the CSV files in the file system source from the dropdown. The default delimiter value is COMMA. |
Use Schema Path Pattern |
Enable the Use Schema Path Pattern checkbox to extract schema (columns/headers) for folders matching the pattern. When the Use Schema Path Pattern is enabled, the Schema Extraction job will not match any individual CSV/PSV/TSV/ Parquet files. It will only match the files which are valid for given schema path pattern. |
Schema Path Pattern |
Provide the Schema Path Pattern for schema extraction. For more information, refer to Schema Path Pattern. |
Selective Extraction¶
On the Metadata Extraction tab, you can select the buckets to include or exclude from extraction. Enable the Selective Extraction toggle if you want only a subset of buckets to be extracted.
To extract only select buckets:
Click Get List of Buckets to first fetch the list of buckets. The status of this action will be logged in the Extraction Job Status table at the bottom of the Metadata Extraction tab.
When bucket synchronization is complete, a drop-down list with the available buckets will become enabled.
Select one or more buckets as required.
Check if you are using the correct filter option. Available filter options are described below:
Filter Option
Description
Extract all Buckets except
Extract metadata from all buckets except from the selected buckets.
Extract only these Buckets
Extract metadata only from the selected buckets.
Click Run Extraction Now to extract metadata. The status of the extraction action is also logged in the Job History table at the bottom of the page.
Extraction Scheduler¶
Automated Extraction¶
If you wish to automatically update the metadata extracted into the Catalog, under Automated and Manual Extraction, turn on the Enable Automated Extraction switch and select the day and time when metadata must be extracted. The metadata extraction will be automatically scheduled to run on the selected schedule.
Schema Extraction¶
Click Run Schema Extraction to extract schema (columns/headers) for each parquet, PSV, TSV, and CSV files. For CSV files, select a delimiter as applicable in the input field in Metadata/Schema Extraction Configuration.
After you have extracted the bucket content, you can additionally extract the schema information for the CSV and Parquet files. Make sure that you have selected the correct delimiter in the CSV File Delimiter field under Connector Settings.
Sampling¶
Connector supports on-demand end user driven Sampling for a file. File sampling is supported for parquet, PSV, TSV, and CSV file formats. File sampling is supported with Basic Authentication (access key and secret key) and SSO Authentication. The Amazon S3 user must have Read access to the file to be sampled.
Successful execution of Schema Extraction is a prerequisite for seeing the file samples.
When Schema extraction is run with Schema Path Pattern, folders are identified as logical schemas. In this case, end users will be able to initiate Sampling for folders identified as logical schemas with relevant columns cataloged for the folder.
Note
STS authentication is not supported for file sampling.
Basic Authentication for Sampling¶
To perform sampling:
Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.
Click Select > Add New button.
Select Basic Authentication from the Authentication Type dropdown. Provide Credential Name, AWS Access Key ID and AWS Access Key Secret. Click Save.
Click Authenticate and you will be redirected to the catalog page.
Click the Run Sample button on the catalog page to see sampled data.
SSO Authentication for Sampling¶
Prerequisites¶
Create and configure the application in IdP, refer to Create an Authentication Application for Alation in the IdP.
Configure the IdP in AWS, refer to Create an Identity Provider in AWS.
Configure the auth plug-in in Alation, refer to AWS IAM.
Make sure that you provide the correct Config Name that is used while setting up the IdP application.
Make sure that you specify SAML 2.0 redirect URL from IDP application.
If you are using the proxy connection, perform the following:
Enter the Alation shell:
sudo /etc/init.d/alation shellRun the following command to list down the existing extra_flags for auth server. This command returns the Existing_Config, make a note of the Existing_Config value in your local.
alation_conf authserver.extra_flags
If you are using the Basic Proxy mode, run the following command to set the proxyHost and proxyPort. Replace the Existing_Config with the value that you noted down in step b:
alation_conf authserver.extra_flags -s " <Existing_Config> -Dhttps.proxyHost=<Proxy_Host> -Dhttp.proxyHost=<Proxy_Host> -Dhttps.proxyPort=<Proxy_Port_Number> -Dhttp.proxyPort=<Proxy_Port_Number>"If there is no Existing_Config, run the following command:
alation_conf authserver.extra_flags -s " -Dhttps.proxyHost=<Proxy_Host> -Dhttp.proxyHost=<Proxy_Host> -Dhttps.proxyPort=<Proxy_Port_Number> -Dhttp.proxyPort=<Proxy_Port_Number>"If you are using the Auth Proxy mode, run the following command to set the proxyHost, proxyPort, proxyUser, and proxyPassword. Replace the Existing_Config with the value that you noted down in step b:
alation_conf authserver.extra_flags -s " <Existing_Config> -Dhttps.proxyHost=R<Proxy_Host> -Dhttp.proxyHost=R<Proxy_Host> -Dhttps.proxyPort=<Proxy_Port_Number> -Dhttp.proxyPort=<Proxy_Port_Number> -Dhttps.proxyUser=<Proxy_Username>-Dhttp.proxyUser=<Proxy_Username> -Dhttps.proxyPassword=<Proxy_Password> -Dhttp.proxyPassword=<Proxy_Password>"If there is no Existing_Config, run the following command:
alation_conf authserver.extra_flags -s " -Dhttps.proxyHost=R<Proxy_Host> -Dhttp.proxyHost=R<Proxy_Host> -Dhttps.proxyPort=<Proxy_Port_Number> -Dhttp.proxyPort=<Proxy_Port_Number> -Dhttps.proxyUser=<Proxy_Username>-Dhttp.proxyUser=<Proxy_Username> -Dhttps.proxyPassword=<Proxy_Password> -Dhttp.proxyPassword=<Proxy_Password>"Restart auth server.
Note
Before restarting, make sure that there are no ongoing jobs, since restarting will affect the ongoing jobs.
alation_supervisor restart java:authserver
Perform Sampling¶
To perform sampling:
Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.
Click Select > Add New button.
Select SSO Authentication from the Authentication Type dropdown. Provide Credential Name and choose the relevant Plugin Config Name. Click Save.
Click Authenticate and you will be redirected to the IdP login page.
Login to IdP with your IdP credential.
Click the Run Sample button on the catalog page to see sampled data.
Incremental MDE Setup¶
The configuration details provided in this section is required if the user wants to do an incremental extraction after performing the first full MDE.
We recommend to use incremental sync based on the following scenarios:
To perform Incremental MDE, make sure to setup a cron job/scheduled job to run the extraction on daily basis. If you run the job on a weekly or bi-weekly basis then incremental extraction needs to process the events for last 7 or 15 days which will eventually slower down the incremental extraction. Do not use the incremental sync if cron job/scheduled job is set to run on weekly or bi-weekly.
Incremental extraction is recommended where the bucket is huge in terms of number of objects and only minimal changes are done in the bucket on a daily basis.
The time taken to perform the incremental extraction depends on the events and the number of objects. We recommend you to enable/disable the incremental sync based on following data:
It may take about 85 minutes to process 100K incremental events
For non-incremental case it may take 5-6 hours to process 50M objects
Setup Required in Amazon S3 for Incremental MDE¶
Step 1: Create an IAM Role for Lambda Function¶
Create an IAM role for Lambda use case with AWS managed policy AWSLambdaBasicExecutionRole and use the following example for policy information to provide write access to the destination bucket created in Step 1: Create a Destination Bucket to Store Inventory. Refer to Lambda execution role in AWS documentation.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::alation-destination-bucket/*"
}
]
}
Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket¶
Open the Lambda service from your S3 console and follow the steps given below to create the Lambda function in the same region as the destination bucket:
Select Create Function.
Select Use a blueprint option and s3-get-object-python template option. Click Configure.
Enter the Function name.
Example: capture-events-for-alation.
In Execution role, choose Use an existing role and select the IAM role created in Step 1. Click Create function.
Replace the code in the window shown below with the code provided here. Make sure to use the correct destination bucket name.
Code to replace:
""" Copyright (C) Alation - All Rights Reserved """ import json import boto3 import hashlib from datetime import datetime s3 = boto3.client('s3') def lambda_handler(event, context): print("Received event: " + json.dumps(event, indent=2)) bucket = event['Records'][0]['s3']['bucket']['name'] # Using md5 hash of a filepath to store as a key filePath = bucket + "/" + event['Records'][0]['s3']['object']['key'] print("File Path: " + filePath) key = hashlib.md5(filePath.encode()).hexdigest() print("Key: " + key) date = datetime.utcnow().strftime("%Y-%m-%d") try: response = s3.put_object( Body=json.dumps(event), # Update the destination bucket name Bucket='alation-destination-bucket', Key='incremental_sync/' + bucket +'/' + date + '/' + key + '.json', ) print(response) except Exception as e: print(e) raise e
Click Deploy after replacing the code.
Step 3: Create Event Notification Configuration¶
Open the source buckets where you want to create an event configuration. For each bucket, perform the following:
Go to Properties > Event Notifications and click Create event notification.
Provide the Event name.
![]()
Select the type of events that need to be captured from this source bucket.
![]()
In Destination, select Lambda function and select Choose from your Lambda functions in Specify Lambda function. Select the Lambda function that is created in Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket from the dropdown list.
![]()
Limitations¶
For incremental extraction, If you have any existing event notifications set up on the source buckets currently, you will not be able to use Incremental extraction.
The last modification time will not be displayed for the folders.
For Incremental Extraction, the last modified timestamp for files for incremental events will be displayed as the time.
After migration, the owner details stays as is and is not removed.
Folders at the last level of a directory that have no name and are empty cannot be extracted.
Native to OCF Migration¶
Troubleshooting¶
Refer to Troubleshooting.
Terraform Configuration¶
AWS Error while executing a script:
Make sure that you have provided a valid permission to the user which is used in the script
Bucket missing error:
Make sure the buckets present in the source-bucket-list.json file exist in S3
Access denied error:
Make sure you have required access while executing a script
Make sure you have added a correct region
Cannot create a bucket error:
Make sure you have given required permission and also a unique name to target_bucket_for_inventory
Cannot create a Lambda function error:
Make sure you have given required permission and also a unique name to lambda_function_name
Alation Configuration¶
Test connection issue:
Make sure the AWS access key and secret key is correct
Make sure you have provided a correct region
Make sure the access key ID and secret access key has access as mentioned in Permissions for IAM User Account
No inventory reports found:
Make sure that the destination bucket is correct
If the destination bucket is correct then wait 48 hours after setting up inventory for inventory generation and try again
MDE or filter extraction failure due to access issue:
Make sure the user has access to destination bucket
Duplicate columns when column name contains special characters:
After you upgrade the Amazon S3 OCF connector to version 3.8.6 or later, you may notice duplicate columns for columns with names containing special characters. To remove duplicates, enable Remove Buckets from the catalog that are not captured by the lists below before you run the next schema extraction.
Note
This will also remove all the buckets that are not part of the selective extraction list.
MDE or bucket list error:
Error:
File is initialized without providing a
rootPath
File path can’t be null or empty
Troubleshooting:
Do not reuse buckets for the destination inventory bucket for this connector
Ensure that all buckets in the destination bucket are named properly
Verify the Inventory Configuration for each source bucket you are ingesting
If there is already a bucket path that has a blank name or a forward slash
/
in the destination inventory bucket, remove it. Review the inventory configuration for each source bucket
Lambda Function Logs¶
Go to Amazon S3 console > Lambda Functions section.
Select the lambda function that you have created and go to the Monitor section.
Click View Logs in Cloudwatch. It will open in a new tab.
In Cloudwatch, search for a particular log based on a date or prefix.