Hive 2 on EMR with QLI over Amazon S3

Hive EMR differs from other Hive configurations in that the query logs are typically captured on Amazon S3.

Configuring EMR Hive in Alation

Follow the instructions for setting up a Hive connection in Alation for Metadata Extraction and Profiling/Sampling.

Query Log Ingestion Setup for EMR Hive

Choose AWS S3 as the connection type. This displays the field relevant to Amazon S3 connection setup. Access ID and Secret Key are mandatory. If the Region is not specified, it will assume the value of us-east-1. Region name should be as listed in AWS API Gateway Table.

../../_images/hive2-1.png

For Amazon S3, the log path format is /bucketname/path/to/logdirectory/ or /bucketname/path/to/logfile.gzip.

EMR archives the query logs and stores them on Amazon S3. Alation assumes that the files are archived. For ephemeral (transient) clusters, it is recommended to specify the master log path.  Alation will traverse the tree with master log path as root and find the logs.

Example:

If the actual log paths are:

/path/to/log/file1

/path/to/log/file2

User can specify the master log path as: /path/