OCF Connector for AWS Athena: Query Log Ingestion

Query log ingestion (QLI) from Athena uses AWS CloudTrail events and requires AWS CloudTrail to be enabled on the AWS account.

QLI Configuration in AWS

  1. Verify that the CloudTrail for your AWS account logs Athena queries. The event type StartQueryExecution should be logged.

  2. In Athena, create an external table using the SQL query example below. The external table can be created in any available database, including the Default database. When using the template, replace placeholder values with actual values for the table name and the CloudTrail S3 bucket name.

    CREATE EXTERNAL TABLE AwsDataCatalog.default.alation_qli (
      eventversion STRING,
      userIdentity STRUCT< type:STRING,
      principalid:STRING,
      arn:STRING,
      accountid:STRING,
      invokedby:STRING,
      accesskeyid:STRING,
      userName:STRING,
      sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING,
      creationdate:STRING>,
      sessionIssuer:STRUCT< type:STRING,
      principalId:STRING,
      arn:STRING,
      accountId:STRING,
      userName:STRING>>>,
      eventTime STRING,
      eventSource STRING,
      eventName STRING,
      awsRegion STRING,
      sourceIpAddress STRING,
      userAgent STRING,
      errorCode STRING,
      errorMessage STRING,
      requestParameters STRING,
      responseElements STRING,
      additionalEventData STRING,
      requestId STRING,
      eventId STRING, resources ARRAY<STRUCT< ARN:STRING,
      accountId: STRING, type:STRING>>,
      eventType STRING,
      apiVersion STRING,
      readOnly STRING,
      recipientAccountId STRING,
      serviceEventDetails STRING,
      sharedEventID STRING,
      vpcEndpointId STRING
    )
    PARTITIONED BY (
      daterange string)
    ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
    STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>'
    TBLPROPERTIES (
      'projection.enabled' = 'true',
      'projection.daterange.type' = 'date',
      'projection.daterange.format' = 'yyyy/MM/dd',
      'projection.daterange.interval' = '1',
      'projection.daterange.interval.unit' = 'DAYS',
      'projection.daterange.range' = '2022/03/01,NOW',
      'storage.location.template' = 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>/${daterange}'
      );
    
  3. We recommend creating a QLI view from the external table and using view-based QLI on the Alation side. Alternatively, you can use custom query-based QLI, which does not require creating a view. When using the view SQL below, make sure to substitute placeholder values with real values.

    If objects in queries are not fully qualified, the queries will not be parsed in Alation even if you pass in defaultDatabases = awsdatacatalog. If objects in your queries are partially qualified, include the following REPLACE script to replace every schema name in the query SQL with the two-part identifier. Example:

    REPLACE
      (REPLACE
        (REPLACE
          (REPLACE
            (json_extract_scalar
            (requestParameters, '$.queryString'),
            'schema1','awsdatacatalog.schema1'),
            'schema2','awsdatacatalog.schema2'),
            'schema3','awsdatacatalog.schema3'),
            'schema4', 'awsdatacatalog.schema4')
          AS queryString
    
  4. Save the fully qualified name of the QLI view. You will need to specify it in Alation when configuring QLI.

QLI View SQL

CREATE OR REPLACE VIEW awsdatacatalog.default.test_QLI_view AS
SELECT (
  CASE CAST(useridentity.type AS VARCHAR)
    WHEN 'AssumedRole'
    THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
    WHEN 'IAMUser'
    THEN useridentity.username
    ELSE useridentity.username
  END
  ) AS userName
  ,'' AS defaultDatabases
  ,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
  ,replace(eventtime, 'T', ' ') AS sessionStartTime
  ,replace(eventtime, 'T', ' ') AS startTime
  -- REPLACE(REPLACE(REPLACE(REPLACE(json_extract_scalar(requestParameters, '$.queryString'),
  -- 'schema1','awsdatacatalog.schema1'), 'schema2','awsdatacatalog.schema2'),
  -- 'schema3', 'awsdatacatalog.schema3'), 'schema4', 'awsdatacatalog.schema4') AS queryString
  ,json_extract_scalar(requestParameters, '$.queryString') AS queryString
  ,NULL AS seconds
  ,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
  AND CAST(ErrorCode AS VARCHAR) IS NULL
  AND json_extract_scalar(requestParameters, '$.queryString') <> 'SELECT 1'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'SHOW%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'MSCK%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DESCRIBE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'EXPLAIN%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'UNLOAD%';

QLI Configuration in Alation

On the Query Log Ingestion tab, you can select the QLI options for your data source and schedule the QLI job if necessary.

Table-Based QLI

Table-based QLI uses the view created from the external QLI table.

In the Table Name field under Connector Settings > Query Extraction, specify the name of the view in which the query logs are available. Make sure that the service account has permissions to select from this view. The table name must be provided in the following format: database.schema.view_name.

Custom Query-Based QLI

You can use a custom query to perform QLI instead of creating a view. In the Custom QLI Query field, provide the expected query structure as shown below and click Save. This query retrieves the information directly from the QLI table.

SELECT (
  CASE CAST(useridentity.type AS VARCHAR)
    WHEN 'AssumedRole'
    THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
    WHEN 'IAMUser'
    THEN useridentity.username
    ELSE useridentity.username
  END
  ) AS userName
  ,'' AS defaultDatabases
  ,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
  ,replace(eventtime, 'T', ' ') AS sessionStartTime
  ,replace(eventtime, 'T', ' ') AS startTime
  -- REPLACE(REPLACE(REPLACE(REPLACE(json_extract_scalar(requestParameters, '$.queryString'),
  -- 'schema1','awsdatacatalog.schema1'), 'schema2','awsdatacatalog.schema2'),
  -- 'schema3', 'awsdatacatalog.schema3'), 'schema4', 'awsdatacatalog.schema4') AS queryString
  ,json_extract_scalar(requestParameters, '$.queryString') AS queryString
  ,NULL AS seconds
  ,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
  AND CAST(ErrorCode AS VARCHAR) IS NULL
  AND json_extract_scalar(requestParameters, '$.queryString') <> 'SELECT 1'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'SHOW%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'MSCK%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DESCRIBE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%DATABASE%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%SCHEMA%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'EXPLAIN%'
  AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'UNLOAD%'
  AND daterange>=STARTTIME
  AND daterange<ENDTIME;

Perform QLI

To perform QLI on demand:

  1. Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, make sure that the Enable Automated Query Log Ingestion toggle is off.

  2. Click Preview to get a sample of the query history data to be ingested.

  3. Click the Import button to begin QLI.

Schedule QLI

You can schedule QLI to be performed automatically on the schedule you set:

  1. Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, enable the Enable Automated Query Log Ingestion toggle.

  2. Under Automated Query Log Ingestion Time, set a schedule by specifying values in the week, day and time fields. The next QLI job will run on the schedule you have specified.

    Note

    The hourly schedule for automated QLI is not supported.