AWS S3 OCF Connector: Install and Configure

The metadata extraction uses the inventory reports for the S3 buckets. Alation extracts the inventory reports in the destination bucket and streams the metadata to the catalog.

../../../_images/S3OCF_01.png

Required Information

The following information is required for configuring the S3 file system source in Alation:

  • AWS access key ID

  • AWS access key secret

  • Bucket to store the inventory for Alation - This destination bucket is created to store the inventory for performing the metadata extraction.

  • Lambda function (Optional) - Setting up of lambda (inventory) function is required only if the user is going to use the incremental extraction.

Prerequisites

In order to connect to S3 from Alation, you will need to perform a number of configurations in AWS S3 that include:

Configuration in S3

The destination bucket can either be created manually using the steps provided in this section or can be done using the Terraform script. Refer to Terraform Configuration.

Step 1: Create a Destination Bucket to Store Inventory

  1. In AWS S3, create a bucket to be used as the destination bucket by Alation. Create this bucket in the same AWS region as your source buckets. This bucket should not be used for any other purpose or store any other content. Alation expects that the destination bucket only stores the inventory reports to be ingested into the catalog.

    Example:

    alation-destination-bucket

  2. Create a folder called inventory inside the destination bucket.

Step 2: Configure the S3 Inventory

For every source bucket that you want to be ingested into the catalog, configure the S3 inventory. Use the steps in Configuring inventory using the S3 console in AWS documentation to perform this configuration. Make sure to follow these recommendations:

  • Select the destination bucket with inventory folder path: s3://<Destination_Bucket_Name>/inventory

  • Set Frequency to Daily or Weekly. Make sure that the frequency matches with the metadata extraction schedule.

  • Set Output Format as CSV

  • Set Status as Enable

  • Set Server-side encryption to Disable

  • Additional fields (Optional)

    • Size

    • Last modified

Note

Additional fields are optional. MDE will not fail even if the additional fields are not configured.

Important

It may take up to 48 hours to deliver the first report into the destination bucket. After the inventory reports for all the buckets that you want ingested in Alation have been delivered to your destination bucket, you can proceed with the configuration on the Alation side.

Create an AWS S3 User

Basic Authentication User

Create a user in AWS S3 with read-only access to the source buckets (buckets that store the data) and the destination bucket (created for the inventory) to set up a connection in Alation and perform the metadata extraction (MDE).

Note

If you do not require incremental MDE, you can provide read-only access to the destination bucket only.

For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.

You can set resources according to your need and keep only destination bucket and source buckets as shown in the below example. Refer to AWS S3 documentation for more information.

Example:

{
    "Version": "2012-10-17",
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": "*"
        }
    ]
}

STS User

Note

File sampling is not supported for STS authentication.

Perform the following steps to create a STS user:

  1. Create a user in AWS without any permission assigned.

  2. Create a role in AWS and assign the policy with the access to the following:

    1. Read-only access to the source buckets and the destination bucket to set up a connection in Alation and perform the metadata extraction (MDE).

    Note

    If you do not require incremental MDE, you can provide read-only access to the destination bucket only.

    1. For column extraction, read-only access is required to the source buckets and the destination buckets even if incremental MDE is turned off.

  1. Goto Trust Relationship section inside the role and add trusted entities as mentioned below:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "{PUT_ARN_OF_THE_USER_CREATED_IN_STEP_1_OF_THIS_SECTION"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    

Configuration in Alation

Step 1: Install the Connector

Alation On-Prem

Important

Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.

  1. If this has not been done on your instance, install the Connector Manager: Install Alation Connector Manager.

  2. Make sure that the connector Zip file which you received from Alation is available on your local machine.

  3. Install the connector on the Connectors Dashboard page. Refer to Manage Connector Dashboard for details.

Alation Cloud Service

Note

OCF connectors require Alation Connector Manager. Alation Connector Manager is available by default on all Alation Cloud Service instances and there is no need to separately install it.

  1. Make sure that the OCF connector Zip file that you received from Alation is available on your local machine.

  2. Install the connector on the Connectors Dashboard page: refer to Manage Connector Dashboard.

Step 2: Create and Configure a New S3 File System Source

  1. Log in to the Alation instance and add a new S3 source by clicking on Apps > Sources > Add > File System.

  2. From the File System Type dropdown, select S3 OCF Connector.

  3. Provide a Title for the file system and click on Add File System. You will be navigated to the Settings page of your new S3 OCF file system.

../../../_images/S3OCF_02.png

Access

On the Access tab, set the file system visibility as follows:

  • Public File System - The file system will be visible to all users of the catalog.

  • Private File System - The file system will be visible to the users allowed access to the file system by file system Admins.

Add new File System Admin users in the File System Admins section.

General Settings

Perform the configuration on the General Settings tab:

  1. Specify Connector Settings:

    Parameter

    Description

    File System Connection

    Region

    Specify the AWS S3 region.

    Example:

    us-east-1

    To use FIPS endpoints for GovCloud, prefix fips- in region.

    Example:

    fips-us-east-1

    Basic Authentication

    Select the Basic Authentication radio button if basic authentication access was provided to the role of your user.

    STS Authentication

    Select the STS Authentication button if the Security Token Service (STS) authentication access was provided to the role of your AWS user.

    Basic Authentication

    (This section is applicable only if the Basic Authentication radio button is selected)

    AWS Access Key ID

    Provide the AWS Access Key ID of the IAM user with basic authentication access. Make sure that the IAM user has access to the destination bucket.

    AWS Access Key Secret

    Provide the AWS Access Key Secret.

    STS Authentication

    (This section is applicable only if the STS Authentication radio button is selected)

    STS: AWS Access Key ID

    Provide the AWS Access Key ID of the IAM user with STS authentication access. Make sure that the IAM user has access to the destination bucket.

    STS: AWS Access Key Secret

    Provide the AWS Access Key Secret.

    Role ARN

    Provide the IAM role to assume with the required permissions

    STS Duration

    Provide the duration of the role session.

    Region-Specific Endpoint

    Select the Region-Specific Endpoint checkbox to use regional endpoints for STS request. If this checkbox is unselected, then the global endpoints will be used for STS request.

    Logging Information

    Log Level

    Select the Log Level to generate logs. The available log levels are based on the log4j framework.

  1. Click Save.

  2. Obfuscate Literals - Enable this toggle to hide the details of the queries in the catalog page that are ingested via QLI or executed in Compose. This toggle is disabled by default.

  3. Under Test Connection, click Test to validate network connectivity.

Deleting the Data Source

You can delete your data source from the General Settings tab. Under Delete Data Source, click Delete to delete the data source connection.

../../../_images/S3OCF_03.png

Metadata Extraction

You can perform a full extraction or incremental extraction with the help of additional configuration on the AWS S3 side.

../../../_images/S3OCF_04.png

Connector Settings

Metadata/Schema Extraction Configuration

Specify the Metadata/Schema extraction configuration settings and click Save:

Parameter

Description

Destination Bucket Name

Provide the name of the destination bucket that hosts the inventory reports.

Note:

The wait time is 24 to 48 hours for the first inventory report to be generated once the inventory function is set. If you run MDE before the inventory report generation then Alation will not extract any data.

Incremental Sync

Select this checkbox to extract only the new object additions or deletions. However, the first extraction is always a full metadata extraction.

Important:

Make sure that you configure the necessary configurations in S3 before you enable the Incremental Sync checkbox. The required S3 configuration can be done in two ways:

Schema Extraction Configuration

Specify the schema extraction configuration settings and click Save:

Parameter

Description

CSV File Delimiter

Select the CSV file delimiter within all the CSV files in the file system source from the dropdown. The default delimiter value is COMMA.

Use Schema Path Pattern

Enable the Use Schema Path Pattern checkbox to extract schema (columns/headers) for folders matching the pattern.

When the Use Schema Path Pattern is enabled, the Schema Extraction job will not match any individual CSV/PSV/TSV/ Parquet files. It will only match the files which are valid for given schema path pattern.

Schema Path Pattern

Provide the Schema Path Pattern for schema extraction.

Selective Extraction

On the Metadata Extraction tab, you can select the buckets to include or exclude from extraction. Enable the Selective Extraction toggle if you want only a subset of buckets to be extracted.

To extract only select buckets:

  1. Click Get List of Buckets to first fetch the list of buckets. The status of this action will be logged in the Extraction Job Status table at the bottom of the Metadata Extraction tab.

  2. When bucket synchronization is complete, a drop-down list with the available buckets will become enabled.

  3. Select one or more buckets as required.

  4. Check if you are using the correct filter option. Available filter options are described below:

Filter Option

Description

Extract all Buckets except

Extract metadata from all buckets except from the selected buckets.

Extract only these Buckets

Extract metadata only from the selected buckets.

  1. Click Run Extraction Now to extract metadata. The status of the extraction action is also logged in the Job History table at the bottom of the page.

Extraction Scheduler

Automated Extraction

If you wish to automatically update the metadata extracted into the Catalog, under Automated and Manual Extraction, turn on the Enable Automated Extraction switch and select the day and time when metadata must be extracted. The metadata extraction will be automatically scheduled to run on the selected schedule.

Schema Extraction

Click Run Schema Extraction to extract schema (columns/headers) for each parquet, PSV, TSV, and CSV files. For CSV files, select a delimiter as applicable in the input field in Metadata/Schema Extraction Configuration.

After you have extracted the bucket content, you can additionally extract the schema information for the CSV and Parquet files. Make sure that you have selected the correct delimiter in the CSV File Delimiter field under Connector Settings.

Sampling

Connector supports on-demand end user driven Sampling for a file. File sampling is supported for parquet, PSV, TSV, and CSV file formats. File sampling is supported with Basic Authentication (access key and secret key) and SSO Authentication. The AWS S3 user must have Read access to the file to be sampled.

Successful execution of Schema Extraction is a prerequisite for seeing the file samples.

When Schema extraction is run with Schema Path Pattern, folders are identified as logical schemas. In this case, end users will be able to initiate Sampling for folders identified as logical schemas with relevant columns cataloged for the folder.

Note

STS authentication is not supported for file sampling.

Basic Authentication for Sampling

To perform sampling:

  1. Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.

    ../../../_images/S3OCF_05.png
  2. Click Select > Add New button.

    ../../../_images/S3OCF_06.png
  3. Select Basic Authentication from the Authentication Type dropdown. Provide Credential Name, AWS Access Key ID and AWS Access Key Secret. Click Save.

    ../../../_images/S3OCF_07.png
  4. Click Authenticate and you will be redirected to the catalog page.

    ../../../_images/S3OCF_08.png
  5. Click the Run Sample button on the catalog page to see sampled data.

    ../../../_images/S3OCF_09.png

SSO Authentication for Sampling

Prerequisites

  1. Create and configure the application in IdP, refer to Create an Authentication Application for Alation in the IdP.

  2. Configure the IdP in AWS, refer to Create an Identity Provider in AWS.

  3. Configure the auth plug-in in Alation, refer to AWS IAM.

    1. Make sure that you provide the correct Config Name that is used while setting up the IdP application.

    2. Make sure that you specify SAML 2.0 redirect URL from IDP application.

Perform Sampling

To perform sampling:

  1. Go to the catalog page of the extracted CSV or Parquet file Samples tab and click Credential Settings.

    ../../../_images/S3OCF_05.png
  2. Click Select > Add New button.

    ../../../_images/S3OCF_06.png
  3. Select SSO Authentication from the Authentication Type dropdown. Provide Credential Name and choose the relevant Plugin Config Name. Click Save.

    ../../../_images/S3OCF_19.png
  4. Click Authenticate and you will be redirected to the IdP login page.

    ../../../_images/S3OCF_20.png
  5. Login to IdP with your IdP credential.

    ../../../_images/S3OCF_21.png
  6. Click the Run Sample button on the catalog page to see sampled data.

    ../../../_images/S3OCF_09.png

Incremental MDE Setup

The configuration details provided in this section is required if the user wants to do an incremental extraction after performing the first full MDE.

We recommend to use incremental sync based on the following scenarios:

  • To perform Incremental MDE, make sure to setup a cron job/scheduled job to run the extraction on daily basis. If you run the job on a weekly or bi-weekly basis then incremental extraction needs to process the events for last 7 or 15 days which will eventually slower down the incremental extraction. Do not use the incremental sync if cron job/scheduled job is set to run on weekly or bi-weekly.

  • Incremental extraction is recommended where the bucket is huge in terms of number of objects and only minimal changes are done in the bucket on a daily basis.

  • The time taken to perform the incremental extraction depends on the events and the number of objects. We recommend you to enable/disable the incremental sync based on following data:

    • It may take about 85 minutes to process 100K incremental events

    • For non-incremental case it may take 5-6 hours to process 50M objects

Setup Required in S3 for Incremental MDE

Step 1: Create an IAM Role for Lambda Function

Create an IAM role for Lambda use case with AWS managed policy AWSLambdaBasicExecutionRole and use the following example for policy information to provide write access to the destination bucket created in Step 1: Create a Destination Bucket to Store Inventory. Refer to Lambda execution role in AWS documentation.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::alation-destination-bucket/*"
        }
    ]
}

Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket

Open the Lambda service from your S3 console and follow the steps given below to create the Lambda function in the same region as the destination bucket:

  1. Select Create Function.

    ../../../_images/S3OCF_10.png
  2. Select Use a blueprint option and s3-get-object-python template option. Click Configure.

    ../../../_images/S3OCF_11.png
  3. Enter the Function name.

    Example: capture-events-for-alation.

  4. In Execution role, choose Use an existing role and select the IAM role created in Step 1. Click Create function.

    ../../../_images/S3OCF_12.png
  5. Replace the code in the window shown below with the code provided here. Make sure to use the correct destination bucket name.

    ../../../_images/S3OCF_13.png

    Code to replace:

    """
    Copyright (C) Alation - All Rights Reserved
    """
    
    import json
    import boto3
    import hashlib
    from datetime import datetime
    
    s3 = boto3.client('s3')
    
    def lambda_handler(event, context):
        print("Received event: " + json.dumps(event, indent=2))
    
        bucket = event['Records'][0]['s3']['bucket']['name']
        # Using md5 hash of a filepath to store as a key
        filePath = bucket + "/" + event['Records'][0]['s3']['object']['key']
        print("File Path: " + filePath)
        key = hashlib.md5(filePath.encode()).hexdigest()
        print("Key: " + key)
        date = datetime.utcnow().strftime("%Y-%m-%d")
        try:
            response = s3.put_object(
                        Body=json.dumps(event),
                        # Update the destination bucket name
                        Bucket='alation-destination-bucket',
                        Key='incremental_sync/' + bucket +'/' + date + '/' + key + '.json',
                    )
            print(response)
        except Exception as e:
            print(e)
            raise e
    
  6. Click Deploy after replacing the code.

Step 3: Create Event Notification Configuration

Open the source buckets where you want to create an event configuration. For each bucket, perform the following:

  1. Go to Properties > Event Notifications and click Create event notification.

    ../../../_images/S3OCF_14.png
  2. Provide the Event name.

    ../../../_images/S3OCF_15.png
  3. Select the type of events that need to be captured from this source bucket.

    ../../../_images/S3OCF_16.png
  1. In Destination, select Lambda function and select Choose from your Lambda functions in Specify Lambda function. Select the Lambda function that is created in Step 2: Create the Lambda Function to Write Event Notifications to the Destination Bucket from the dropdown list.

../../../_images/S3OCF_17.png

Limitations

  1. For incremental extraction, If you have any existing event notifications set up on the source buckets currently, you will not be able to use Incremental extraction.

  2. The last modification time will not be displayed for the folders.

  3. For Incremental Extraction, the last modified timestamp for files for incremental events will be displayed as the time.

  4. After migration, the owner details will be stay as is and not be removed.

Native to OCF Migration

Refer to Native to OCF Migration.

Troubleshooting

Refer to Troubleshooting.

Terraform Configuration

  1. AWS Error while executing a script:

    • Make sure that you have provided a valid permission to the user which is used in the script

  2. Bucket missing error:

    • Make sure the buckets present in the source-bucket-list.json file exist in S3

  3. Access denied error:

    • Make sure you have required access while executing a script

    • Make sure you have added a correct region

  4. Cannot create a bucket error:

    • Make sure you have given required permission and also a unique name to target_bucket_for_inventory

  5. Cannot create a Lambda function error:

    • Make sure you have given required permission and also a unique name to lambda_function_name

Alation Configuration

  1. Test connection issue:

    • Make sure the AWS access key and secret key is correct

    • Make sure you have provided a correct region

    • Make sure the access key and secret key has required access as mentioned in Create an AWS S3 User for section of this document

  2. No inventory reports found:

    • Make sure that the destination bucket is correct

    • If that is correct then wait till 48 hours after setting up inventory for inventory generation and try again

  3. MDE or filter extraction failure due to access issue:

    • Make sure the user has access to destination bucket

  4. MDE or bucket list error:

    Error:

    • File is initialized without providing a rootPath

    • File path can’t be null or empty

    Troubleshooting:

    • Do not reuse buckets for the destination inventory bucket for this connector.

    • Ensure that all buckets in the destination bucket are named properly.

    • Verify the Inventory Configuration for each source bucket you are ingesting.

    • If there is already a bucket path that has a blank name or / in the destination inventory bucket, remove it. Review the inventory configuration for each source bucket.

Lambda Function Logs

  1. Go to AWS S3 console > Lambda Functions section.

  2. Select the lambda function that you have created and go to the Monitor section.

  3. Click View Logs in Cloudwatch, that will open a new tab.

  4. In Cloudwatch, search for a particular log based on date or prefix.