Azure Databricks OCF Connector: Install and Configure¶
Prerequisites¶
Network Connectivity¶
Open outbound TCP port 443 to the Azure Databricks server.
Service Account¶
Create a service account for Azure Databricks, refer to Manage users - Azure Databricks.
Permissions for Metadata Extraction and Profiling¶
The service account must have the Can Attach to permission in order to connect to the cluster. Refer to Permission Requirements in Microsoft documentation for more details.
Users can optionally add the Can Restart permission to automatically trigger the cluster to start if its state is terminated while connecting.
Permissions for Metadata Extraction and Profiling¶
The service account must have the following permissions to perform MDE and profiling:
Workspace access — Refer to Manage users.
Cluster level access — Refer to Cluster access control.
By default, all cluster users have access to all data stored in a cluster’s managed tables unless table access control is enabled for that cluster. The table access control option is only available for high-concurrency clusters. Refer to Table Access Control for more information
If the Table Access Control option is enabled on the cluster:
Grant the SELECT privilege on all schemas and all their tables and views in a catalog.
GRANT USAGE ON CATALOG <catalog-name> TO `<user>@<domain-name>`; GRANT SELECT ON CATALOG <catalog-name> TO `<user>@<domain-name>`;Grant the SELECT privilege on a specific schema and all its tables and views.
GRANT USAGE ON SCHEMA <schema-name> TO `<user>@<domain-name>`; GRANT SELECT ON SCHEMA <schema-name> TO `<user>@<domain-name>`;Grant the SELECT privilege on specific tables and views in a schema.
GRANT USAGE ON SCHEMA <schema-name> TO `<user>@<domain-name>`; GRANT SELECT ON TABLE <schema-name>.<table-name> TO `<user>@<domain-name>`; GRANT SELECT ON VIEW <schema-name>.<view-name> TO `<user>@<domain-name>`;
Refer to Data object privileges for more information.
JDBC URI¶
Format¶
Use the following format to build the JDBC URI:
spark://<hostname>:443/default;transportMode=http;ssl=1;httpPath=<HTTP_Path>;AuthMech=3;
Example¶
spark://adb-900784758547414.14.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/900734538547414/1012-1275722-nju5lmv8;AuthMech=3;
QLI Configuration in Azure Databricks¶
File-Based QLI¶
This is the recommended approach. For general information about file-based QLI, see File-Based Query Log Ingestion.
Note
QLI is supported for standard and high concurrency Databricks clusters.
QLI configuration requires that you enable logging on your Databricks cluster. There are several ways to enable logs for QLI. You can either add a Python init script to your cluster to enable logs (recommended) or use a script to enable logs and run it every time the cluster is started or restarted.
Alation recommends to use the Python init script method as this method generates log files of smaller size and the admin does not need to run the script again when the Cluster is restarted.
Enable Logs with an Init Script¶
This is the recommended option. Perform the steps in this sections to enable QLI using the Python init script:
In the Databricks settings portal, ensure that the Cluster Log Path and Destination are set under the Logging tab. Do not leave the Destination path as None.
For more information on how to set the logging path in DBFS, see Microsoft Documentation. You need to mount external storage onto the DBFS for log storage. For more information on the mount process, see:
Create a Python Notebook and run the following script on your Databricks Cluster using this Python notebook. This script creates the scripts directory where the QLI script will be stored:
dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
In the same notebook, run the script given below to create the file with the init script in the scripts directory.
dbutils.fs.put("/databricks/scripts/init.sh",""" #!/bin/bash echo "Executing on Driver: $DB_IS_DRIVER" if [[ $DB_IS_DRIVER = "TRUE" ]]; then LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties" else LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties" fi echo "Adjusting log4j.properties here: ${LOG4J_PATH}" echo "log4j.logger.org.apache.spark.sql.execution.SparkSqlParser=DEBUG" >> ${LOG4J_PATH} echo "log4j.appender.publicFile.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss.SS} [%t] %p %c{1}: %m%n" >> ${LOG4J_PATH}""", True)Note
Alternatively, you can create the init script given below locally and copy it to the Databricks cluster using the following command:
dbfs cp init.sh dbfs:/databricks/scripts/init.sh
Use the following command to make sure that the script was created successfully:
display(dbutils.fs.ls("dbfs:/databricks/scripts/init.sh"))Use the cluster configuration page to configure the cluster to run the init script: Add init script.
Restart the cluster.
Enable Logs With a Python Script¶
In the Databricks settings portal, ensure that the cluster log path and destination is set under the Logging tab. Do not set the DESTINATION PATH as NONE.
![]()
For more information on how to set the logging path in DBFS, see the corresponding Databricks documentation. You can choose to mount external storage onto the DBFS for log storage. For more information on mount process, see:
Open a Databricks Python notebook. For more information on creating a notebook, see Create a Notebook
Copy the contents of the Python script below to enable DEBUG query logs that contain SQL queries and set the pattern in which logs have to show up. This Python script should be run whenever the cluster gets started or restarted.
Python script to enable DEBUG query logs:
log4j = spark.sparkContext._jvm.org.apache.log4j log4j.LogManager.getRootLogger().setLevel(log4j.Level.DEBUG) ca = log4j.LogManager.getRootLogger().getAppender("publicFile") ca.setLayout(log4j.PatternLayout("%d{yyyy-MM-dd HH:mm:ss.SS} [%t] %p %c{1}: %m%n"));
Table-Based QLI¶
QLI can be supported using the External table. For more information on external tables, see the corresponding Databricks documentation.
Steps to be done on the Databricks side:¶
After you have enabled the DEBUG-level logs, create the logs directory in the cluster path:
dbutils.fs.mkdirs("dbfs:/cluster-logs/<cluster-id>/logs/")
Example:
dbutils.fs.mkdirs("dbfs:/cluster-logs/0130-102557-aft119/logs/")
Note
Create the required folder before running the next Python script. If this is not created, “File Not Found” error will occur.
Open a new Databricks Python notebook.
Copy the contents from the Python script below into a Python notebook cell. Replace the
input_dir
andoutput_dir
:If you are accessing the directories mounted onto the Databricks File System (DBFS), use the prefix
/dbfs/mnt/<mounted_filepath>
.Example:
/dbfs/mnt/cluster-logs/0130-102557-aft119/driver/
where/mnt/cluster-logs
is the path to the mounted Azure Blob file system.If you are accessing the log files that are directly in DBFS, use the prefix
/dbfs/<filepath>
Example:
/dbfs/0206-072111-lox669/driver/
.
Python Script to enable QLI:
import gzip
import re
from datetime import datetime, timedelta
input_dir = '/dbfs/mnt/cluster-logs/0130-102557-aft119/driver/'
output_dir = '/dbfs/mnt/cluster-logs/0130-102557-aft119/logs/'
required_logger_line_regex = r'^\d+-\d+-\d+ \d+:\d+:\d+\.\d* \[[^\]]*?\] \S+ (?:SparkSqlParser|audit): .*'
logger_line_regex = r'^\d+-\d+-\d+ \d+:\d+:\d+\.\d* \[[^\]]*?\] \S+ \S+:.*'
prev_line = ''
line_count = 0
file_date_hour = str((datetime.now() - timedelta(days=0, hours=1)).strftime('%Y-%m-%d-%H'))
input_file_name = input_dir + 'log4j-' + file_date_hour + '.log.gz'
output_file_name = output_dir + 'log4j-' + file_date_hour + '.log.gz'
print('Transforming log4j-{}.log.gz'.format(file_date_hour))
with gzip.open(input_file_name, 'rt') as fin:
with gzip.open(output_file_name, 'wt') as out:
for line in fin:
line = line.strip()
if not line:
continue
if re.match(required_logger_line_regex, line):
if prev_line:
out.write(prev_line + '\n')
line_count += 1
prev_line = line
else:
if re.match(logger_line_regex, line):
if prev_line:
out.write(prev_line + '\n')
prev_line = ''
line_count += 1
elif re.match(required_logger_line_regex, prev_line):
prev_line = prev_line + ' ' + line
if prev_line:
out.write(prev_line + '\n')
line_count += 1
print('Lines written: ', str(line_count))
Note
Specify the correct file path for <input dir> and <output dir>. If the file path for input dir or output dir is not specified properly, the script will not be able to locate the files.
Run the script after the previous hour log file is created and verify that you get the following output:
Expected output when the previous hour file name is log4j-2019-05-02-10.log.gz and when there is no query or user information in the file is as follows:
Transforming log4j-2019-05-02-10.log.gz Lines written: 0
Expected output when the previous hour file name is log4j-2019-05-02-11.log.gz and when there is query or user information is available in the file is as follows:
Transforming log4j-2019-05-02-11.log.gz Lines written: 275
Schedule the Python script 2 to run at or after ten minutes past the hour. If the Python script is scheduled exactly at the hour, the log file will not be available. There will be a fiveminute delay in writing the contents to the log file and converting the log file into a compressed gzip file. So, it is advisable to schedule the Python script to run at or after ten minutes past the hour which ensures the availability of log files.
Create an External Table with the location property set to the directory where the new files are stored.
For setting the Location property to the directories mounted in DBFS, use the prefix
/mnt/<mounted_filepath>
Example:
/mnt/cluster-logs/0130-102557-aft119/logs/
where/mnt/cluster-logs
is the mounted path to a Blob file system.For setting the Location property to the directory present in DBFS, use the prefix
/<filepath>
Example:
/0206-072111-lox669/logs/
Sample External Table Creation Query:
DROP TABLE IF EXISTS databricks_demo.external_log_table;
CREATE EXTERNAL TABLE
databricks_demo.external_log_table(
date_time_string STRING,
thread_name STRING,
level STRING,
logger STRING,
message STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
("input.regex" = "^(\\S+ \\S+) \\[(.*?)\\] (\\S+) (\\S+): (.*?)")
LOCATION
"/mnt/cluster-logs/0130-102557-aft119/logs/";
Ensure that the external table is populated with the data from the files stored in the
output_dir
by running a select query. This step ascertains that the external table is populated with data from files stored in theoutput_dir
. Example:SELECT * FROM<SCHEMA.EXTERNAL_TABLE_NAME>
Create the alation_qli view which takes this external table as an input and gives the data as required by Alation.
View Query for QLI:
DROP VIEW IF EXISTS databricks_demo.alation_qli;
CREATE VIEW
databricks_demo.alation_qli
AS
SELECT
distinct *, CONCAT(userName, '_', startTime) sessionId
FROM (
SELECT
a.date_time_string startTime,
regexp_extract(a.message, "Parsing command: (.*)", 1) queryString,
CASE
WHEN b.message is null THEN 'unknown'
WHEN SUBSTR(b.message, 6, 11) = 'Basic token' THEN TRIM(SUBSTR(b.message, 18, instr(b.message, 'ip=') - 19))
ELSE TRIM(SUBSTR(b.message, 5, instr(b.message, 'ip=') - 6))
END AS userName
FROM
databricks_demo.external_log_table a
LEFT OUTER JOIN
databricks_demo.external_log_table b ON (a.thread_name = b.thread_name)
WHERE
a.logger = 'SparkSqlParser'
and b.logger = 'audit'
and b.date_time_string > a.date_time_string
);
Installation¶
Step 1: Install the Connector¶
Alation On-Prem¶
Important
Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.
To install an OCF connector:
If this has not been done on your instance, install the Alation Connector Manager: Install Alation Connector Manager.
Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.
Alation Cloud Service¶
Connecting to Cloud Databases¶
Note
On Alation Cloud Service instances, Alation Connector Manager is available by default.
To install an OCF connector:
Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.
Connecting to On-Prem Databases¶
Important
To connect to an on-prem database from an Alation Cloud Service instance, you may need to use Alation Agent.
Ensure that the Alation Agent functionality is enabled on your Alation instance. Create a Support ticket with Alation for an Alation representative to enable the Alation Agent feature on your instance and to receive the Alation Agent installer.
Install Alation Agent using the information in Alation Agent.
Install the OCF connector on Alation Agent.
Step 2: Create and Configure a New Azure Databricks Data Source¶
In Alation, add a new data source:
Log in to Alation as a Server Admin.
Expand the Apps menu on the right of the main toolbar and select Sources.
On the Sources page, click +Add on the top right of the page and in the list that opens, click Data Source. This will open the Add a Data Source wizard.
On the first screen of the wizard, specify a name for your data source, assign more Data Source Admins, if necessary, and click the Continue Setup button on the bottom of the screen. The Add a Data Source screen will open.
The only field that you need to populate on the Add a Data Source screen is Database Type. From the Database Type dropdown, select the connector name. You will be navigated to the Settings page of your new data source.
The name of this connector is Azure Databricks OCF Connector.

Access¶
On the Access tab, set the data source visibility using these options:
Public Data Source—The data source will be visible to all users of the catalog.
Private Data Source—The data source will be visible to the users allowed access to the data source by Data Source Admins.
You can add new Data Source Admin users in the Data Source Admins section.
General Settings¶
Perform the configuration on the General Settings tab:
Application Settings¶
Not applicable.
Connector Settings¶
Data Source Connection¶
Populate the data source connection information and save the values by clicking Save in this section.
Parameter
Description
Data Source Connection
JDBC URI
Specify the JDBC URI in the required format.
Username
Specify the service account username.
Password
Specify the service account password.
Logging Configuration¶
Select the logging level for the connector logs and save the values you provided by clicking Save in this section. The available log levels are based on the Log4j framework.
Parameter
Description
Log Level
Select the log level to generate logs. The available options are INFO, DEBUG, WARN, TRACE, ERROR, FATAL, ALL.
Obfuscate Literals¶
Obfuscate Literals — Enable this toggle to hide the details of the queries in the catalog page that are ingested via QLI or executed in Compose. This toggle is disabled by default.
Test Connection¶
Under Test Connection, click Test to validate network connectivity.
Metadata Extraction¶
You can configure metadata extraction (MDE) for an OCF data source on the Metadata Extraction tab of the Settings page. Refer to Configure Metadata Extraction for OCF Data Sources for information about the available configuration options.

Compose¶
For details about configuring the Compose tab of the Settings, refer to Configure Compose for OCF Data Sources.
Note
OAuth connection is not supported for this data source.
![]()
Sampling and Profiling¶
Sampling and profiling is supported. For details, see Configure Sampling and Profiling for OCF Data Sources.
Per-Object Parameters¶
Refer to Sampling and Profiling.
Query Log Ingestion¶
On the Query Log Ingestion tab, you can select the QLI options for your data source and schedule the QLI job if necessary.

Connector Settings¶
Query Log Ingestion Type¶
Select the Query Log Ingestion Type as Table based query log ingestion or File based query log ingestion.
Table Based QLI¶
In Connector Settings > Query Log Ingestion Type > Query to Execute, enter the following query to enable table-based QLI. Do not substitute values STARTTIME1
and STARTTIME2
use them as is. Click Import.
SELECT
SUBSTR(startTime, 1, 19) startTime,
queryString,
userName,
sessionID
FROM
databricks_demo.alation_qli
WHERE
startTime between 'STARTTIME1' and 'STARTTIME2'
File-Based Query Log Ingestion¶
Select the storage type as any one of following:
Azure Blob Storage
ADLS
Paste the JSON given below in the Log Extraction Configuration Json field. In the JSON, change the
folder path
to the path of the storage mount where the generated logs (the Logging Path you have configured under Logging on the Databricks side) are stored and click Save.{ "folderPath":"/path/to/log/file/", "nThread":"10", "threadTimeOut":"2000", "parserType":"LOG4J", "log4jConversionPattern":"TIMESTAMP [THREAD] LEVEL LOGGER MESSAGE", "log4jTimeFormat":"yyyy-MM-dd HH:mm:ss.SSS", "requiredExtraction":[ { "fieldName":"extractSqlQuery", "keyValuePair":{ "loggerName":"SparkSqlParser", "regex":"Parsing command:(?<queryString>[\\w\\W]*)" } }, { "fieldName":"extractUserInfo", "keyValuePair":{ "loggerName":"audit", "regex":"ugi=(?:\\(Basic token\\))?(?<userName>[\\S]+)" } }, { "fieldName":"extractTimeTaken", "keyValuePair":{ "loggerName":"Retrieve", "regex":"Execution Time = (?<milliSeconds>[\\d]+)" } } ] }
Configure Azure Blob Storage Connection¶
Provide the Azure Blob Storage details in the following fields if Azure Blob Storage is selected as storage type and click Save:
Parameter |
Description |
---|---|
Use Shared Access Signature |
Select the Use Shared Access Signature checkbox to use the shared access signature for authentication. |
Storage Account Name |
Provide the storage account name. |
Access Key/Shared Access Signature |
Provide the share access signature if the Use Shared Access Signature checkbox is selected. If not, provide the access key. |
Blob Container |
TProvide the blob container name. |
Storage Endpoint Suffix |
Specify the storage endpoint suffix with the following suffix according to the data source:
|
Configure ADLS Connection¶
Provide the ADLS details in the following fields if ADLS is selected as storage type and click Save:
Parameter |
Description |
---|---|
ADLS URI |
Provide the ADLS URI. |
Tenant ID |
Provide the tenant ID. |
Client ID |
Provide the client ID. |
Client Secret |
Provide the client secret. |
Storage Endpoint Suffix |
Specify the storage endpoint suffix with the following suffix according to the data source:
|
Automated and Manual Query Log Ingestion¶
You can either perform QLI manually on demand or enable automated QLI:
To perform manual QLI, under the Automated and Manual Query Log Ingestion section of the Query Log Ingestion tab, ensure that the Enable Automated Query Log Ingestion toggle is disabled.
Click Preview to get a sample of the query history data to be ingested.
Click the Import button to perform QLI on demand.
Note
Metadata extraction must be completed first before running QLI.
To schedule QLI, enable the Enable Automated Query Log Ingestion toggle.
Set a schedule under Automated Query Log Ingestion Time by specifying values in the week, day and time fields. The next QLI job will run on the schedule you have specified.
Note
The hourly schedule for automated QLI is not supported.
Custom Settings¶
This configuration option is available if Profiling V2 is enabled.
To profile a column, Alation runs a default query if no custom query is specified for this column. Default queries for column profiling differ depending on the data type: numeric or non-numeric.
The default query supplied by Alation can be modified based on the user requirements and datasource type. The following default query template can be used to revert back to the default query from a custom query:
Numeric columns:
SELECT MIN({column_name}) AS MIN, MAX({column_name}) AS MAX, AVG({column_name}) AS MEAN, (COUNT(*) - COUNT({column_name})) AS "#NULL", (CASE WHEN COUNT(*) > 0 THEN ((COUNT(*) - COUNT({column_name})) * 100.0 / COUNT(*)) ELSE 0.0 END) AS "%NULL" FROM {schema_name}.{table_name};
Non numeric columns:
SELECT ((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) AS "#NULL", (CASE WHEN COUNT(*) > 0 THEN ((((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) * 100.0 / count(*)) ) ELSE 0.0 END ) AS "%NULL" FROM {schema_name}.{table_name};
Important
The profiling queries MUST contain {column_name}
, {schema_name}
, and {table_name}
. Users must check the compatibility of the default query based on the datasource type and modify it if required.
The default profiling query calculates the Profiling stats that are displayed on the Overview tab of the Column catalog page. When you customize the query, you can also customize the statistics that should be calculated and displayed:
Troubleshooting¶
Refer to Troubleshooting.