OCF Connector for AWS Athena: Query Log Ingestion¶
Query log ingestion (QLI) from Athena uses AWS CloudTrail events and requires AWS CloudTrail to be enabled on the AWS account.
QLI Configuration in AWS¶
Verify that the CloudTrail for your AWS account logs Athena queries. The event type StartQueryExecution should be logged.
In Athena, create an external table using the SQL query example below. The external table can be created in any available database, including the Default database. When using the template, replace placeholder values with actual values for the table name and the CloudTrail S3 bucket name.
CREATE EXTERNAL TABLE AwsDataCatalog.default.alation_qli ( eventversion STRING, userIdentity STRUCT< type:STRING, principalid:STRING, arn:STRING, accountid:STRING, invokedby:STRING, accesskeyid:STRING, userName:STRING, sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING, creationdate:STRING>, sessionIssuer:STRUCT< type:STRING, principalId:STRING, arn:STRING, accountId:STRING, userName:STRING>>>, eventTime STRING, eventSource STRING, eventName STRING, awsRegion STRING, sourceIpAddress STRING, userAgent STRING, errorCode STRING, errorMessage STRING, requestParameters STRING, responseElements STRING, additionalEventData STRING, requestId STRING, eventId STRING, resources ARRAY<STRUCT< ARN:STRING, accountId: STRING, type:STRING>>, eventType STRING, apiVersion STRING, readOnly STRING, recipientAccountId STRING, serviceEventDetails STRING, sharedEventID STRING, vpcEndpointId STRING ) PARTITIONED BY ( daterange string) ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>' TBLPROPERTIES ( 'projection.enabled' = 'true', 'projection.daterange.type' = 'date', 'projection.daterange.format' = 'yyyy/MM/dd', 'projection.daterange.interval' = '1', 'projection.daterange.interval.unit' = 'DAYS', 'projection.daterange.range' = '2022/03/01,NOW', 'storage.location.template' = 's3://<bucket>/AWSLogs/<account>/CloudTrail/<region>/${daterange}' );
We recommend creating a QLI view from the external table and using view-based QLI on the Alation side. Alternatively, you can use custom query-based QLI, which does not require creating a view. When using the view SQL below, make sure to substitute placeholder values with real values.
If objects in queries are not fully qualified, the queries will not be parsed in Alation even if you pass in
defaultDatabases = awsdatacatalog
. If objects in your queries are partially qualified, include the followingREPLACE
script to replace every schema name in the query SQL with the two-part identifier. Example:REPLACE (REPLACE (REPLACE (REPLACE (json_extract_scalar (requestParameters, '$.queryString'), 'schema1','awsdatacatalog.schema1'), 'schema2','awsdatacatalog.schema2'), 'schema3','awsdatacatalog.schema3'), 'schema4', 'awsdatacatalog.schema4') AS queryString
Save the fully qualified name of the QLI view. You will need to specify it in Alation when configuring QLI.
QLI View SQL¶
CREATE OR REPLACE VIEW awsdatacatalog.default.test_QLI_view AS
SELECT (
CASE CAST(useridentity.type AS VARCHAR)
WHEN 'AssumedRole'
THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
WHEN 'IAMUser'
THEN useridentity.username
ELSE useridentity.username
END
) AS userName
,'' AS defaultDatabases
,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
,replace(eventtime, 'T', ' ') AS sessionStartTime
,replace(eventtime, 'T', ' ') AS startTime
-- REPLACE(REPLACE(REPLACE(REPLACE(json_extract_scalar(requestParameters, '$.queryString'),
-- 'schema1','awsdatacatalog.schema1'), 'schema2','awsdatacatalog.schema2'),
-- 'schema3', 'awsdatacatalog.schema3'), 'schema4', 'awsdatacatalog.schema4') AS queryString
,json_extract_scalar(requestParameters, '$.queryString') AS queryString
,NULL AS seconds
,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
AND CAST(ErrorCode AS VARCHAR) IS NULL
AND json_extract_scalar(requestParameters, '$.queryString') <> 'SELECT 1'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'SHOW%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'MSCK%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DESCRIBE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'EXPLAIN%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'UNLOAD%';
QLI Configuration in Alation¶
On the Query Log Ingestion tab, you can select the QLI options for your data source and schedule the QLI job if necessary.
Table-Based QLI¶
Table-based QLI uses the view created from the external QLI table.
In the Table Name field under Connector Settings > Query Extraction, specify the name of the view in which the query logs are available. Make sure that the service account has permissions to select from this view. The table name must be provided in the following format: database.schema.view_name
.
Custom Query-Based QLI¶
You can use a custom query to perform QLI instead of creating a view. In the Custom QLI Query field, provide the expected query structure as shown below and click Save. This query retrieves the information directly from the QLI table.
SELECT (
CASE CAST(useridentity.type AS VARCHAR)
WHEN 'AssumedRole'
THEN SUBSTR(useridentity.principalid, position(':' IN useridentity.principalid)+1)
WHEN 'IAMUser'
THEN useridentity.username
ELSE useridentity.username
END
) AS userName
,'' AS defaultDatabases
,json_extract_scalar(responseElements, '$.queryExecutionId') AS sessionId
,replace(eventtime, 'T', ' ') AS sessionStartTime
,replace(eventtime, 'T', ' ') AS startTime
-- REPLACE(REPLACE(REPLACE(REPLACE(json_extract_scalar(requestParameters, '$.queryString'),
-- 'schema1','awsdatacatalog.schema1'), 'schema2','awsdatacatalog.schema2'),
-- 'schema3', 'awsdatacatalog.schema3'), 'schema4', 'awsdatacatalog.schema4') AS queryString
,json_extract_scalar(requestParameters, '$.queryString') AS queryString
,NULL AS seconds
,false as cancelled
FROM default.alation_qli
WHERE eventname = 'StartQueryExecution'
AND CAST(ErrorCode AS VARCHAR) IS NULL
AND json_extract_scalar(requestParameters, '$.queryString') <> 'SELECT 1'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'SHOW%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'MSCK%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DROP%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'DESCRIBE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'CREATE%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%DATABASE%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'ALTER%SCHEMA%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'EXPLAIN%'
AND json_extract_scalar(requestParameters, '$.queryString') NOT LIKE 'UNLOAD%'
AND daterange>=STARTTIME
AND daterange<ENDTIME;
Perform QLI¶
To perform QLI on demand:
Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, make sure that the Enable Automated Query Log Ingestion toggle is off.
Click Preview to get a sample of the query history data to be ingested.
Click the Import button to begin QLI.
Schedule QLI¶
You can schedule QLI to be performed automatically on the schedule you set:
Under the Automated and Manual Query Log Ingestion section on the Query Log Ingestion tab, enable the Enable Automated Query Log Ingestion toggle.
Under Automated Query Log Ingestion Time, set a schedule by specifying values in the week, day and time fields. The next QLI job will run on the schedule you have specified.
Note
The hourly schedule for automated QLI is not supported.