OCF Connector for Amazon Athena: Query Log Ingestion

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

Query log ingestion (QLI) from Amazon Athena uses AWS CloudTrail events and requires AWS CloudTrail to be enabled on the AWS account.

QLI Configuration in AWS

  1. Verify that the CloudTrail for your AWS account logs Amazon Athena queries. The event type StartQueryExecution should be logged.

  2. In Amazon Athena, create an external table using the SQL query example below. The external table can be created in any available database, including the Default database. When using the template, replace placeholder values with actual values for the table name and the CloudTrail S3 bucket name.

    CREATE EXTERNAL TABLE AwsDataCatalog.default.alation_qli (
      eventversion STRING,
      userIdentity STRUCT< type:STRING,
      principalid:STRING,
      arn:STRING,
      accountid:STRING,
      invokedby:STRING,
      accesskeyid:STRING,
      userName:STRING,
      sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING,
      creationdate:STRING>,
      sessionIssuer:STRUCT< type:STRING,
      principalId:STRING,
      arn:STRING,
      accountId:STRING,
      userName:STRING>>>,
      eventTime STRING,
      eventSource STRING,
      eventName STRING,
      awsRegion STRING,
      sourceIpAddress STRING,
      userAgent STRING,
      errorCode STRING,
      errorMessage STRING,
      requestParameters STRING,
      responseElements STRING,
      additionalEventData STRING,
      requestId STRING,
      eventId STRING, resources ARRAY<STRUCT< ARN:STRING,
      accountId: STRING, type:STRING>>,
      eventType STRING,
      apiVersion STRING,
      readOnly STRING,
      recipientAccountId STRING,
      serviceEventDetails STRING,
      sharedEventID STRING,
      vpcEndpointId STRING
    )
    PARTITIONED BY (
      daterange string)
    ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
    STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>'
    TBLPROPERTIES (
      'projection.enabled' = 'true',
      'projection.daterange.type' = 'date',
      'projection.daterange.format' = 'yyyy/MM/dd',
      'projection.daterange.interval' = '1',
      'projection.daterange.interval.unit' = 'DAYS',
      'projection.daterange.range' = '2022/03/01,NOW',
      'storage.location.template' = 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>/${daterange}'
      );
    
  3. We recommend creating a QLI view from the external table and using view-based QLI on the Alation side. Alternatively, you can use custom query-based QLI, which does not require creating a view. When using the view SQL below, make sure to substitute placeholder values with real values.

    If objects in queries are not fully qualified, the queries will not be parsed in Alation even if you pass in defaultDatabases = awsdatacatalog. If objects in your queries are partially qualified, include the following REPLACE script to replace every schema name in the query SQL with the two-part identifier. Example:

    REPLACE
      (REPLACE
        (REPLACE
          (REPLACE
            (json_extract_scalar
            (requestParameters, '$.queryString'),
            'schema1','awsdatacatalog.schema1'),
            'schema2','awsdatacatalog.schema2'),
            'schema3','awsdatacatalog.schema3'),
            'schema4', 'awsdatacatalog.schema4')
          AS queryString
    
  4. Save the fully qualified name of the QLI view. You will need to specify it in Alation when configuring QLI.

QLI View SQL

CREATE OR REPLACE VIEW awsdatacatalog.default.test_QLI_view AS
SELECT (
  CASE CAST(useridentity.type AS VARCHAR)
    WHEN 'AssumedRole'
    THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
    WHEN 'IAMUser'
    THEN useridentity.username
    ELSE useridentity.username
  END
  ) AS userName
  ,'' AS defaultDatabases
  ,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
  ,replace(eventtime, 'T', ' ') AS sessionStartTime
  ,replace(eventtime, 'T', ' ') AS startTime
  ,json_extract_scalar(requestParameters, '$.queryString') AS queryString
  ,0 AS seconds
  ,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
  AND CAST(ErrorCode AS VARCHAR) IS NULL;

QLI Configuration in Alation

On the Query Log Ingestion tab, you can select the QLI options for your data source and schedule the QLI job if necessary.

Table-Based QLI

Table-based QLI uses the view created from the external QLI table.

In the Table Name field under Connector Settings > Query Extraction, specify the name of the view in which the query logs are available. Make sure that the service account has permissions to select from this view. The table name must be provided in the following format: database.schema.view_name.

Custom Query-Based QLI

If you cannot create the view for QLI, for example, due to access restrictions, you can use a custom QLI query to extract query history into Alation. If you opt for custom query-based QLI, Alation will query the system table storing query history or the table you’ve created to enable QLI every time you manually run QLI or when the QLI job runs on schedule.

For custom query-based QLI to succeed, ensure that the service account has enough permissions to select from the QLI table.

The template for the QLI query is given below. You can customize it by adding, removing, or changing the filter, but the columns and their aliases must remain as is since the connector expects this query structure.

Note

When using the QLI query template, do not substitute the STARTTIME and ENDTIME parameters in the WHERE filter. These parameters are not actual column names and should stay as is. They are expected by the connector and will be substituted with the start and end date of the QLI range selected in the user interface when QLI is run manually or on schedule.

To configure query-based QLI:

  1. Go to the Query Log Ingestion tab of the Settings page of your OCF data source.

  2. Under Connector Settings > Query Extraction, in the Custom QLI Query field, provide the QLI query.

  3. Click Save.

QLI Query Template

SELECT (
  CASE CAST(useridentity.type AS VARCHAR)
    WHEN 'AssumedRole'
    THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
    WHEN 'IAMUser'
    THEN useridentity.username
    ELSE useridentity.username
  END
  ) AS userName
  ,'' AS defaultDatabases
  ,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
  ,replace(eventtime, 'T', ' ') AS sessionStartTime
  ,replace(eventtime, 'T', ' ') AS startTime
  ,json_extract_scalar(requestParameters, '$.queryString') AS queryString
  ,0 AS seconds
  ,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
  AND CAST(ErrorCode AS VARCHAR) IS NULL
  AND daterange>=date_format(cast(STARTTIME as timestamp),'%Y/%m/%d')
  AND daterange<date_format(cast(ENDTIME as timestamp),'%Y/%m/%d');

Perform QLI

To perform QLI on demand:

  1. Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, make sure that the Enable Automated Query Log Ingestion toggle is off.

  2. Click Preview to get a sample of the query history data to be ingested.

  3. Click the Import button to begin QLI.

Schedule QLI

You can schedule QLI to be performed automatically on the schedule you set:

  1. Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, enable the Enable Automated Query Log Ingestion toggle.

  2. Under Automated Query Log Ingestion Time, set a schedule by specifying values in the week, day and time fields. The next QLI job will run on the schedule you have specified.

    Note

    The hourly schedule for automated QLI is not supported.