AWS S3 OCF Connector: Install and Configure¶
The metadata extraction uses the inventory reports for the S3 buckets. Alation extracts the inventory reports in the destination bucket and streams the metadata to the catalog.

Required Information¶
The following information is required for configuring the S3 file system source in Alation:
AWS access key ID
AWS access key secret
Bucket to store the inventory for Alation - This destination bucket is created to store the inventory for performing the metadata extraction.
Lambda function (Optional) - Setting up of lambda (inventory) function is required only if the user is going to use the incremental extraction.
Prerequisites¶
In order to connect to S3 from Alation, you will need to perform a number of configurations in AWS S3 that include:
Configuration in S3¶
The destination bucket can either be created manually using the steps provided in this section or can be done using the Terraform script. Refer to Terraform Configuration.
Step 1: Create a Destination Bucket to Store Inventory¶
In AWS S3, create a bucket to be used as the destination bucket by Alation. Create this bucket in the same AWS region as your source buckets. This bucket should not be used for any other purpose or store any other content. Alation expects that the destination bucket only stores the inventory reports to be ingested into the catalog.
Example:
alation-destination-bucket
Create a folder called inventory inside the destination bucket.
Step 2: Configure the S3 Inventory¶
For every source bucket that you want to be ingested into the catalog, configure the S3 inventory. Use the steps in Configuring inventory using the S3 console in AWS documentation to perform this configuration. Make sure to follow these recommendations:
Select the destination bucket with inventory folder path: s3://<Destination_Bucket_Name>/inventory
Set Frequency to Daily or Weekly. Make sure that the frequency matches with the metadata extraction schedule.
Set Output Format as CSV
Set Status as Enable
Set Server-side encryption to Disable
Additional fields (Optional)
Size
Last modified
Note
Additional fields are optional. MDE will not fail even if the additional fields are not configured.
Important
It may take up to 48 hours to deliver the first report into the destination bucket. After the inventory reports for all the buckets that you want ingested in Alation have been delivered to your destination bucket, you can proceed with the configuration on the Alation side.
Create an AWS S3 User¶
Basic Authentication User¶
Create a user in AWS S3 with read-only access to the source buckets (buckets that store the data) and the destination bucket (created for the inventory) to set up a connection in Alation and perform the metadata extraction (MDE).
Note
If you do not require incremental MDE, you can provide read-only access to the destination bucket only.
For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.
You can set resources according to your need and keep only destination bucket and source buckets as shown in the below example. Refer to AWS S3 documentation for more information.
Example:
{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}
STS User¶
Note
File sampling is not supported for STS authentication.
Perform the following steps to create a STS user:
Create a user in AWS without any permission assigned.
Create a role in AWS and assign the policy with the access to the following:
Read-only access to the source buckets and the destination bucket to set up a connection in Alation and perform the metadata extraction (MDE).
Note
If you do not require incremental MDE, you can provide read-only access to the destination bucket only.
For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.
Goto Trust Relationship section inside the role and add trusted entities as mentioned below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "{PUT_ARN_OF_THE_USER_CREATED_IN_STEP_1_OF_THIS_SECTION" }, "Action": "sts:AssumeRole" } ] }
Configuration in Alation¶
Step 1: Install the Connector¶
Alation On-Prem¶
Important
Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.
If this has not been done on your instance, install the Connector Manager: Install Alation Connector Manager.
Make sure that the connector Zip file which you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page. Refer to Manage Connector Dashboard for details.
Alation Cloud Service¶
Note
OCF connectors require Alation Connector Manager. Alation Connector Manager is available by default on all Alation Cloud Service instances and there is no need to separately install it.
Make sure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page: refer to Manage Connector Dashboard.
Step 2: Create and Configure a New S3 File System Source¶
Log in to the Alation instance and add a new S3 source by clicking on Apps > Sources > Add > File System.
From the File System Type dropdown, select S3 OCF Connector.
Provide a Title for the file system and click on Add File System. You will be navigated to the Settings page of your new S3 OCF file system.

Access¶
On the Access tab, set the file system visibility as follows:
Public File System - The file system will be visible to all users of the catalog.
Private File System - The file system will be visible to the users allowed access to the file system by file system Admins.
Add new File System Admin users in the File System Admins section.
General Settings¶
Perform the configuration on the General Settings tab:
Specify Connector Settings:
Parameter
Description
File System Connection
Region
Specify the AWS S3 region.
Example:
us-east-1
To use FIPS endpoints for GovCloud, prefix fips- in region.
Example:
fips-us-east-1
Basic Authentication
Select the Basic Authentication radio button if basic authentication access was provided to the role of your user.
STS Authentication
Select the STS Authentication button if the Security Token Service (STS) authentication access was provided to the role of your AWS user.
Basic Authentication
(This section is applicable only if the Basic Authentication radio button is selected)
AWS Access Key ID
Provide the AWS Access Key ID of the IAM user with basic authentication access. Make sure that the IAM user has access to the destination bucket.
AWS Access Key Secret
Provide the AWS Access Key Secret.
STS Authentication
(This section is applicable only if the STS Authentication radio button is selected)
STS: AWS Access Key ID
Provide the AWS Access Key ID of the IAM user with STS authentication access. Make sure that the IAM user has access to the destination bucket.
STS: AWS Access Key Secret
Provide the AWS Access Key Secret.
Role ARN
Provide the IAM role to assume with the required permissions
STS Duration
Provide the duration of the role session.
Region-Specific Endpoint
Select the Region-Specific Endpoint checkbox to use regional endpoints for STS request. If this checkbox is unselected, then the global endpoints will be used for STS request.
Logging Information
Log Level
Select the Log Level to generate logs. The available log levels are based on the log4j framework.
Click Save.
Obfuscate Literals - Enable this toggle to hide the details of the queries in the catalog page that are ingested via QLI or executed in Compose. This toggle is disabled by default.
Under Test Connection, click Test to validate network connectivity.
Deleting the Data Source¶
You can delete your data source from the General Settings tab. Under Delete Data Source, click Delete to delete the data source connection.
Metadata Extraction¶
You can perform a full extraction or incremental extraction with the help of additional configuration on the AWS S3 side.

Connector Settings¶
Metadata/Schema Extraction Configuration¶
Specify the Metadata/Schema extraction configuration settings and click Save:
Parameter |
Description |
---|---|
Destination Bucket Name |
Provide the name of the destination bucket that hosts the inventory reports. Note: The wait time is 24 to 48 hours for the first inventory report to be generated once the inventory function is set. If you run MDE before the inventory report generation then Alation will not extract any data. |
Incremental Sync |
Select this checkbox to extract only the new object additions or deletions. However, the first extraction is always a full metadata extraction. Important: Make sure that you configure the necessary configurations in S3 before you enable the Incremental Sync checkbox. The required S3 configuration can be done in two ways:
|
Schema Extraction Configuration¶
Specify the schema extraction configuration settings and click Save:
Parameter |
Description |
---|---|
CSV File Delimiter |
Select the CSV file delimiter within all the CSV files in the file system source from the dropdown. The default delimiter value is COMMA. |
Use Schema Path Pattern |
Enable the Use Schema Path Pattern checkbox to extract schema (columns/headers) for folders matching the pattern. When the Use Schema Path Pattern is enabled, the Schema Extraction job will not match any individual CSV/PSV/TSV/ Parquet files. It will only match the files which are valid for given schema path pattern. |
Schema Path Pattern |
Provide the Schema Path Pattern for schema extraction. |
Selective Extraction¶
On the Metadata Extraction tab, you can select the buckets to include or exclude from extraction. Enable the Selective Extraction toggle if you want only a subset of buckets to be extracted.
To extract only select buckets:
Click Get List of Buckets to first fetch the list of buckets. The status of this action will be logged in the Extraction Job Status table at the bottom of the Metadata Extraction tab.
When bucket synchronization is complete, a drop-down list with the available buckets will become enabled.
Select one or more buckets as required.
Check if you are using the correct filter option. Available filter options are described below:
Filter Option
Description
Extract all Buckets except
Extract metadata from all buckets except from the selected buckets.
Extract only these Buckets
Extract metadata only from the selected buckets.
Click Run Extraction Now to extract metadata. The status of the extraction action is also logged in the Job History table at the bottom of the page.
Extraction Scheduler¶
Automated Extraction¶
If you wish to automatically update the metadata extracted into the Catalog, under Automated and Manual Extraction, turn on the Enable Automated Extraction switch and select the day and time when metadata must be extracted. The metadata extraction will be automatically scheduled to run on the selected schedule.
Schema Extraction¶
Click Run Schema Extraction to extract schema (columns/headers) for each parquet, PSV, TSV, and CSV files. For CSV files, select a delimiter as applicable in the input field in Metadata/Schema Extraction Configuration.
After you have extracted the bucket content, you can additionally extract the schema information for the CSV and Parquet files. Make sure that you have selected the correct delimiter in the CSV File Delimiter field under Connector Settings.
Sampling¶
Connector supports on-demand end user driven Sampling for a file. File sampling is supported for parquet, PSV, TSV, and CSV file formats. File sampling is supported with Basic Authentication (access key and secret key) and SSO Authentication. The AWS S3 user must have Read access to the file to be sampled.
Successful execution of Schema Extraction is a prerequisite for seeing the file samples.
When Schema extraction is run with Schema Path Pattern, folders are identified as logical schemas. In this case, end users will be able to initiate Sampling for folders identified as logical schemas with relevant columns cataloged for the folder.
Note
STS authentication is not supported for file sampling.
Basic Authentication for Sampling¶
To perform sampling:
Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.
Click Select > Add New button.
Select Basic Authentication from the Authentication Type dropdown. Provide Credential Name, AWS Access Key ID and AWS Access Key Secret. Click Save.
Click Authenticate and you will be redirected to the catalog page.
Click the Run Sample button on the catalog page to see sampled data.
SSO Authentication for Sampling¶
Prerequisites¶
Create and configure the application in IdP, refer to Create an Authentication Application for Alation in the IdP.
Configure the IdP in AWS, refer to Create an Identity Provider in AWS.
Configure the auth plug-in in Alation, refer to AWS IAM.
Make sure that you provide the correct Config Name that is used while setting up the IdP application.
Make sure that you specify SAML 2.0 redirect URL from IDP application.
Perform Sampling¶
To perform sampling:
Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.
Click Select > Add New button.
Select SSO Authentication from the Authentication Type dropdown. Provide Credential Name and choose the relevant Plugin Config Name. Click Save.
Click Authenticate and you will be redirected to the IdP login page.
Login to IdP with your IdP credential.
Click the Run Sample button on the catalog page to see sampled data.
Incremental MDE Setup¶
The configuration details provided in this section is required if the user wants to do an incremental extraction after performing the first full MDE.
We recommend to use incremental sync based on the following scenarios:
To perform Incremental MDE, make sure to setup a cron job/scheduled job to run the extraction on daily basis. If you run the job on a weekly or bi-weekly basis then incremental extraction needs to process the events for last 7 or 15 days which will eventually slower down the incremental extraction. Do not use the incremental sync if cron job/scheduled job is set to run on weekly or bi-weekly.
Incremental extraction is recommended where the bucket is huge in terms of number of objects and only minimal changes are done in the bucket on a daily basis.
The time taken to perform the incremental extraction depends on the events and the number of objects. We recommend you to enable/disable the incremental sync based on following data:
It may take about 85 minutes to process 100K incremental events
For non-incremental case it may take 5-6 hours to process 50M objects
Setup Required in S3 for Incremental MDE¶
Step 1: Create an IAM Role for Lambda Function¶
Create an IAM role for Lambda use case with AWS managed policy AWSLambdaBasicExecutionRole and use the following example for policy information to provide write access to the destination bucket created in Step 1: Create a Destination Bucket to Store Inventory. Refer to Lambda execution role in AWS documentation.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::alation-destination-bucket/*"
}
]
}
Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket¶
Open the Lambda service from your S3 console and follow the steps given below to create the Lambda function in the same region as the destination bucket:
Select Create Function.
Select Use a blueprint option and s3-get-object-python template option. Click Configure.
Enter the Function name.
Example: capture-events-for-alation.
In Execution role, choose Use an existing role and select the IAM role created in Step 1. Click Create function.
Replace the code in the window shown below with the code provided here. Make sure to use the correct destination bucket name.
Code to replace:
""" Copyright (C) Alation - All Rights Reserved """ import json import boto3 import hashlib from datetime import datetime s3 = boto3.client('s3') def lambda_handler(event, context): print("Received event: " + json.dumps(event, indent=2)) bucket = event['Records'][0]['s3']['bucket']['name'] # Using md5 hash of a filepath to store as a key filePath = bucket + "/" + event['Records'][0]['s3']['object']['key'] print("File Path: " + filePath) key = hashlib.md5(filePath.encode()).hexdigest() print("Key: " + key) date = datetime.utcnow().strftime("%Y-%m-%d") try: response = s3.put_object( Body=json.dumps(event), # Update the destination bucket name Bucket='alation-destination-bucket', Key='incremental_sync/' + bucket +'/' + date + '/' + key + '.json', ) print(response) except Exception as e: print(e) raise e
Click Deploy after replacing the code.
Step 3: Create Event Notification Configuration¶
Open the source buckets where you want to create an event configuration. For each bucket, perform the following:
Go to Properties > Event Notifications and click Create event notification.
Provide the Event name.
![]()
Select the type of events that need to be captured from this source bucket.
![]()
In Destination, select Lambda function and select Choose from your Lambda functions in Specify Lambda function. Select the Lambda function that is created in Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket from the dropdown list.
![]()
Limitations¶
For incremental extraction, If you have any existing event notifications set up on the source buckets currently, you will not be able to use Incremental extraction.
The last modification time will not be displayed for the folders.
For Incremental Extraction, the last modified timestamp for files for incremental events will be displayed as the time.
After migration, the owner details will be stay as is and not be removed.
Native to OCF Migration¶
Refer to Native to OCF Migration.
Troubleshooting¶
Refer to Troubleshooting.
Terraform Configuration¶
AWS Error while executing a script:
Make sure that you have provided a valid permission to the user which is used in the script
Bucket missing error:
Make sure the buckets present in the source-bucket-list.json file exist in S3
Access denied error:
Make sure you have required access while executing a script
Make sure you have added a correct region
Cannot create a bucket error:
Make sure you have given required permission and also a unique name to target_bucket_for_inventory
Cannot create a Lambda function error:
Make sure you have given required permission and also a unique name to lambda_function_name
Alation Configuration¶
Test connection issue:
Make sure the AWS access key and secret key is correct
Make sure you have provided a correct region
Make sure the access key and secret key has required access as mentioned in Create an AWS S3 User for section of this document
No inventory reports found:
Make sure that the destination bucket is correct
If that is correct then wait till 48 hours after setting up inventory for inventory generation and try again
MDE or filter extraction failure due to access issue:
Make sure the user has access to destination bucket
MDE or bucket list error:
Error:
File is initialized without providing a rootPath
File path can’t be null or empty
Troubleshooting:
Do not reuse buckets for the destination inventory bucket for this connector.
Ensure that all buckets in the destination bucket are named properly.
Verify the Inventory Configuration for each source bucket you are ingesting.
If there is already a bucket path that has a blank name or / in the destination inventory bucket, remove it. Review the inventory configuration for each source bucket.
Lambda Function Logs¶
Go to AWS S3 console > Lambda Functions section.
Select the lambda function that you have created and go to the Monitor section.
Click View Logs in Cloudwatch, that will open a new tab.
In Cloudwatch, search for a particular log based on date or prefix.