Databricks Unity Catalog OCF Connector: Install and Configure¶
Alation Cloud Service Applies to Alation Cloud Service instances of Alation
Customer Managed Applies to customer-managed instances of Alation
Prerequisites¶
Unity Catalog is enabled in Databricks.
The workspaces have been assigned to the Unity Catalog metastore. There is a running Unity-compatible interactive cluster or SQL warehouse that Alation will connect to and extract the metadata.
For Lineage Extraction (Beta) and Query Log Ingestion (Beta), the system schema
system.access
is enabled.
Network Connectivity¶
Open inbound TCP port 443 to the Databricks Unity Catalog server.
Extraction of Complex Data Types¶
Complex data types, such as map, array, and struct are extracted.
To enable their representation in the Alation user interface as a tree structure, on your Alation instance, set the alation_conf parameter alation.feature_flags.enable_generic_nosql_support
to True
.
Additionally, you can use the parameter alation.feature_flags.docstore_tree_table_depth
to define the depth of the display. By default, three levels are displayed.
Important
After changing values of these parameters, restart Alation Supervisor from the Alation shell:
alation_supervisor restart all
.
For information about using alation_conf, refer to Using alation_conf.
Create Service Account¶
Create a Databricks account-level user to be used as a service account in Alation.
Assign the service account the
USAGE
andSELECT
permissions on all catalog, schema, and table objects that you want to catalog in Alation.Lineage extraction requires additional permissions. See Permissions for Lineage Extraction below.
Query log ingestion (QLI) requires additional permissions. See Permissions for QLI below.
Assign the service account to workspace(s) using the information in Manage users, service principals, and groups. The service account must be assigned to the same workspace(s) as the cluster or SQL warehouse.
Assign the service account the
can use
permissions on the cluster or SQL warehouse.
Permissions for Lineage Extraction¶
The Unity Catalog lineage feature is currently in Public Preview in Databricks and may require separate access enablement:
Contact your Databricks administrator about enabling access to this feature. Lineage extraction in Alation uses this functionality and is currently a beta feature.
Lineage extraction requires access to the system
catalog, system.access
schema, and the tables in this schema. Grant the Alation service account these permissions:
USE CATALOG on catalog
system
USE SCHEMA on schema
system.access
SELECT on table
system.access.table_lineage
SELECT on table
system.access.column_lineage
The service account does not require USE or SELECT for all catalogs, schemas, and tables captured in the lineage records in the system.access
lineage tables. All lineage will be extracted. Any objects that are not cataloged but exist in the system.access
tables will be marked as temporary (TMP) on Lineage diagrams unless temporary objects have been disabled.
Permissions for QLI¶
The Unity Catalog audit log feature is currently in Public Preview in Databricks and may require separate access enablement:
Contact your Databricks administrator about enabling access to this feature. Query history extraction and ingestion (QLI) in Alation uses this functionality and is currently a beta feature.
QLI requires access to the system
catalog, system.access
schema, and the system.access.audit
table in this schema. Grant the Alation service account these permissions:
USE CATALOG on catalog
system
USE SCHEMA on schema
system.access
SELECT on table
system.access.audit
Authentication¶
The connector supports token-based authentication.
Follow the steps in Generate a personal access token in Databricks documentation to generate a personal access token.
JDBC URI¶
If you are using a Databricks cluster, to get the JDBC URI, refer to:
If you are using a Databricks SQL warehouse (SQL endpoints), to get the JDBC URI, refer to:
The JDBC URI string you will need to provide in Alation depends on the connector version:
Newer versions 2.0.0 and later use the Databricks JDBC driver.
Older versions below version 2.0.0 use the JDBC Spark driver.
When specifying the JDBC URI in Alation, remove the jdbc:
prefix.
Note
The property
UseNativeQuery=0
is required for custom query-based sampling and profiling. Without this property in the JDBC URI, custom query-based sampling or profiling will fail. If you are not using custom query-based sampling and profiling in your implementation of this data source type, you can omit this property from the JDBC URI string.Find more information in ANSI SQL-92 query support in JDBC in Azure Databricks documentation.
Connection String for Databricks JDBC Driver¶
Find more information in Databricks JDBC driver in Databricks documentation.
Format¶
databricks://<hostname>:443/default;transportMode=http;ssl=1;httpPath=<databricks_http_path_prefix>/<databricks_cluster_id>;AuthMech=3;UseNativeQuery=0;
Examples¶
Compute Cluster
databricks://dbc-32ak8401-ac16.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/2479012801311837/0612-093241-z79vbfjk;AuthMech=3;UseNativeQuery=0;
SQL Warehouse
databricks://dbc-32am8401-ac16.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/9f5d50hhsaeb0k23;UseNativeQuery=0;
Connection String for Spark JDBC Driver¶
Find more information in JDBC Spark driver in Databricks documentation.
Format¶
spark://<hostname>:443/default;transportMode=http;ssl=1;httpPath=<databricks_http_path_prefix>/<databricks_cluster_id>;AuthMech=3;
Example¶
spark://adb-58175503737864.5.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/endpoints/0f38f55be5cbd786;AuthMech=3;
JDBC URI Properties¶
The connector adds some JDBC properties to the connection. These properties do not need to be explicitly included into the JDBC URI connection string in Alation:
RowsFetchedPerBlock
—Limits the number of objects returned in each fetch call. Used to regulate the amount of memory used by the connector and prevent OOM errors. Set to500
. The memory utilization of the MDE job is captured in the connector logs when debug logging is enabled.
UserAgentEntry
—Identifies driver request calls from the connector in Databricks. Set toalation+unity_catalog
.
Configuration in Alation¶
STEP 1: Install the Connector¶
Alation On-Prem¶
Important
Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.
To install an OCF connector:
If this has not been done on your instance, install the Alation Connector Manager: Install Alation Connector Manager.
Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.
Alation Cloud Service¶
Note
On Alation Service Cloud instances, Alation Connector Manager is available by default.
Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.
Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.
STEP 2: Create and Configure a New Data Source¶
In Alation, add a new data source:
Log in to Alation as a Server Admin.
Expand the Apps menu on the right of the main toolbar and select Sources.
On the Sources page, click +Add on the top right of the page and in the list that opens, click Data Source. This will open the Add a Data Source wizard.
On the first screen of the wizard, specify a name for your data source, assign additional Data Source Admins, if necessary, and click the Continue Setup button on the bottom. The Add a Data Source screen will open.
On the Add a Data Source screen, the only field you should populate is Database Type. From the Database Type dropdown, select the connector name. After that you will be navigated to the Settings page of your new data source.
The name of this connector is Databricks Unity Catalog OCF Connector.
Access¶
On the Access tab, set the data source visibility using these options:
Public Data Source—The data source will be visible to all users of the catalog.
Private Data Source—The data source will be visible to the users allowed access to the data source by Data Source Admins.
You can add new Data Source Admin users in the Data Source Admins section.
General Settings¶
Note
This section describes configuring settings for credentials and connection information stored in the Alation database. If your organization has configured Azure KeyVault or AWS Secrets Manager to hold such information, the user interface for the General Settings page will change to include the following icons to the right of most options:
By default, the database icon is selected, as shown. In the vault case, instead of the actual credential information, you enter the ID of the secret. See Configure Secrets for OCF Connector Settings for details.
Perform the configuration on the General Settings tab.
Application Settings¶
Specify Application Settings if applicable. Click Save to save the changes after providing the information.
Parameter |
Description |
---|---|
BI Connection Info |
This parameter is used to generate lineage between the current data source and another source in the catalog, for example a BI source that retrieves data from the underlying database. The parameter accepts host and port information of the corresponding BI data source connection. Use the following format: You can provide multiple values as a comma-separated list:
Find more details in BI Connection Info. |
Disable Automatic Lineage Generation (Present in the user interface in connector versions before 2.0.3.6564) |
This checkbox enables or disables automatic lineage generation from QLI, MDE, and Compose queries. If you are using connector version 2.0.2.6259, you must disable auto-lineage generation before running QLI by selecting this checkbox. On connector version 2.0.3.6564 or newer, you don’t have to manually disable automatic lineage as it’s disabled programmatically. |
Connector Settings¶
Populate the data source connection information and save the values by clicking Save in this section.
Data Source Connection¶
Parameter |
Description |
---|---|
JDBC URI |
Specify the JDBC URI in the required format. |
Username |
For token-based authentication, use the value |
Password |
Paste the personal access token for the service account. |
Logging Configuration¶
Select the logging level for the connector logs and save the values by clicking Save in this section. The available log levels are based on the Log4j framework.
Parameter |
Description |
---|---|
Log level |
Select the log level to generate logs. The available options are INFO, DEBUG, WARN, TRACE, ERROR, FATAL, ALL. |
Obfuscate Literals¶
Skip this section as it’s not applicable to Databricks Unity Catalog data sources.
Test Connection¶
Under Test Connection, click Test to validate network connectivity.
If the connection test fails, make sure the JDBC URI and service account credentials are correct.
Metadata Extraction¶
You can configure metadata extraction (MDE) for an OCF data source on the Metadata Extraction tab of the Settings page. For Databricks Unity Catalog data sources, Alation supports full and selective default MDE. Custom query-based MDE is not supported.
Refer to Configure Metadata Extraction for OCF Data Sources for information about the available configuration options.
Lineage Extraction Job¶
Metadata extraction will trigger a dependent direct lineage extraction job. If you have enabled the system.access
tables in Unity Catalog and have provided the service account with access to tables in the system.access
schema, lineage data will be extracted into Alation. See Capturing Lineage for more details. If you have not enabled the system.access
tables, the downstream lineage extraction job will fail.
You can track the result of the lineage extraction job in the Details log accessible from the Job History table at the bottom of the Metadata Extraction tab. The screenshot below shows an example of the Details log, where records relevant to the direct lineage extraction job start with the DirectLineageExtraction Info Log
record. More detailed logs will be available in the connector log.
Lineage Extraction (Beta)¶
For Databricks Unity Catalog data sources, Alation calculates lineage based on lineage metadata extracted from lineage system tables and on data definition language (DDL) query statements from Compose. Lineage extraction is available as a beta feature that is based on system lineage tables in Databricks:
Lineage is extracted directly from these tables (Direct Lineage Extraction).
Direct Lineage Extraction¶
Direct lineage extraction—The connector extracts lineage directly from system tables in Databricks that store lineage. Direct lineage extraction will be automatically triggered as a “downstream” job dependent on the metadata extraction job. Query history, joins, filters, and popularity information will not be available as the result of direct lineage extraction as they require query log ingestion (QLI).
For lineage data to be generated, the service account needs access to the system tables that store lineage. See Permissions for Lineage Extraction.
After you have granted the service account these permissions, lineage will be extracted automatically. No additional configuration is required on the Alation side.
Direct Lineage Feature Flags¶
Direct lineage extraction from OCF data sources is enabled by default. It is controlled by two alation_conf flags:
alation.ocf.mde.direct_lineage.enable_extraction
—Enables or disables the direct lineage feature for all data sources in the catalog that support it.
alation.ocf.mde.direct_lineage.incremental_extraction
—Enables or disables incremental lineage extraction. This flag only applies if the main feature flagalation.ocf.mde.direct_lineage.enable_extraction
is set toTrue
.
For more on alation_conf, see Using alation_conf.
For more details on incremental lineage, see Incremental Lineage Extraction below.
Capturing Lineage¶
On the Databricks side, DDL query runs generate lineage records in the lineage tables in the Unity Catalog metastore: system.access.table_lineage
and system.access.column_lineage
.
In Alation, direct lineage extraction is triggered as a downstream job after metadata extraction (MDE). The direct lineage extraction job reads the system lineage tables and extracts and ingests lineage information into Alation. Ingested lineage will become available on the Lineage tab of the catalog pages of data objects (tables and views) under the Databricks Unity Catalog OCF data source.
Direct lineage extraction depends on the lineage capture feature in Databricks, inheriting its requirements and limitations. If specific records are not available in the system lineage tables due to limitations on the Databricks side, they will not be available in Alation either. Review the requirements and limitations in Databricks documentation:
Note
Lineage records are stored in the system lineage tables in Databricks for 30 days. Dropping a view or table with lineage will not immediately remove its associated lineage. Alation would still show the object on the Lineage diagram, while the corresponding catalog page will display the message This object appears to have been removed from the data source.
Altering columns to a table or view after lineage has already been created does not alter the existing lineage records.
Dataflow Content From Direct Lineage Extraction¶
Dataflow objects generated by lineage extraction will not show the SQL queries. The Dataflow Content field will contain the URL of the Databricks entity that generated the lineage (a notebook, a dashboard, a workflow, or a Databricks SQL query).
For more on Dataflow objects, see Dataflow Objects.
Lineage from Compose Queries¶
In addition to lineage extracted from Databricks, Alation will capture lineage from DDL SQL queries executed in Compose. The Compose queries will be available in the Dataflow Content field on the Lineage diagram.
Important
Use the multipart schema names when referencing schemas, tables, and views in Compose for lineage to be captured—
catalog.schema.table
.
Incremental Lineage Extraction¶
Incremental lineage extraction is available from connector version 1.1.0.4393 and Alation release 2023.1.4.
Incremental lineage extraction is supplemental to direct lineage extraction. The lineage extraction job creates a timestamp “bookmark” and stores it in Postgres. The bookmark is based on the created_at
value in the system lineage tables that is the latest timestamp. During the next MDE, Alation will only extract those lineage records where the created_at
value is later than the value stored in the bookmark. For example, if an initial MDE job extracts 50 lineage records into the Alation catalog on day one and creates a bookmark, then the next MDE will only extract lineage records where the created_at
value is later than the bookmark stored in Alation, thus adding lineage records incrementally. The same extraction job will create a new bookmark to be used during the subsequent MDE.
If the the incremental lineage feature flag is disabled, the MDE job will extract all available lineage records but only ingest the records that were not previously extracted and are not present in Alation. This may increase the time of the MDE extraction job, depending on how much metadata you are extracting.
See Direct Lineage Feature Flags for information on the feature flag that controls incremental lineage extraction.
Sampling and Profiling¶
Sampling and profiling, including dynamic profiling, is supported from connector version 1.0.2.3423, compatible with Alation version 2022.4 or newer.
For details, see Configure Sampling and Profiling for OCF Data Sources.
Important
To use custom query-based sampling and profiling, ensure that the JDBC URI includes the
UseNativeQuery=0
property. If you enable dynamic profiling, then users should ensure that their individual connections also include this property.
Query Log Ingestion (Beta)¶
Query Log Ingestion (QLI) is available from connector version 2.0.2 or newer. The connector supports query history extraction from the system audit log table:
Prerequisites¶
Before you can perform QLI from your data source in Alation, you will need to:
Enable the
system.access
schema:Configure audit log delivery on your Databricks account and workspace:
Enable verbose audit logging to capture additional events in the audit log table. Your logs will include the following additional events:
For notebooks:
runCommand
For SQL warehouse:
commandStart
,commandFinish
Alation requires these events to be available in the audit table to successfully extract and ingest query history:
Grant Permissions for QLI to the Alation service account.
In Alation, ensure that you run metadata extraction before running QLI.
QLI Query¶
Alation runs the following query on the system.access.audit
table to retrieve query history:
SELECT request_event_time AS startAt,
username,
queryText,
TIMESTAMPDIFF(SECOND, request_event_time, response_event_time) AS executionTime,
statusCode,
errorMessage,
session_id
FROM (
SELECT
user_identity.email AS username,
request_params.commandText AS queryText,
event_time AS request_event_time,
request_params.commandId
FROM system.access.audit
WHERE action_name = 'commandSubmit'
) submitTable
LEFT JOIN (
SELECT
response.status_code AS statusCode,
response.error_message AS errorMessage,
session_id,
event_time AS response_event_time,
request_params.commandId
FROM system.access.audit
WHERE action_name = 'commandFinish'
) finishTable
WHERE submitTable.commandId = finishTable.commandId
AND submitTable.request_event_time >= "FROM_TIME"
AND submitTable.request_event_time <="TO_TIME"
AND finishTable.statusCode = 200
AND finishTable.errorMessage IS NULL
UNION ALL
SELECT
event_time AS startAt,
user_identity.email AS username,
request_params.commandText AS queryText,
request_params.executionTime AS executionTime,
response.status_code AS statusCode,
response.error_message AS errorMessage,
session_id
FROM system.access.audit
WHERE action_name in ('runCommand')
AND event_time >= "FROM_TIME"
AND event_time <= "TO_TIME"
AND response.status_code = 200
AND response.error_message IS NULL;
Perform QLI¶
You can either run QLI manually on demand or configure it to run automatically on a schedule.
Note
Alation supports a maximum SQL query length of 100K rows. There is also a 50 KB limitation on query file size. Queries with more than 100K rows or queries larger than 50 KB cannot be parsed and ingested by QLI jobs.
Run QLI Manually¶
To perform QLI manually on demand:
If you are using connector version 2.0.2.6259, you must disable lineage auto-generation before running QLI. Open the General Settings tab, select the Disable Automatic Lineage Generation checkbox under Application Settings, and click Save. If you are using a newer connector version (2.0.3.6564 or later), you don’t have to manually disable automatic lineage as from this connector version, it’s disabled programmatically. The corresponding control is not present in the user interface.
Note
After every QLI job, the QLI framework in Alation runs a downstream sub-job to calculate lineage based on the query history metadata ingested during QLI. However, for a Databricks Unity Catalog data source, lineage is extracted directly from the system tables. Disabling automatic lineage calculation disables the lineage downstream job and prevents creating duplicate lineage paths from both direct lineage extraction and QLI.
On the Query Log Ingestion tab of the Settings page of your data source, under Automated and Manual Query Log Ingestion ensure that the Enable Automated Query Log Ingestion toggle is disabled.
Specify the desired date range using the Date Range calendar widgets. You will need to specify the start date and the end date separately.
Click the Preview button to preview the queries that will be ingested. This will run a preview job that will fetch a subset of the queries based on your QLI configuration. The result will be displayed in a table.
Click the Import button. This will start a query log ingestion job.
View Job Status¶
The status of preview or QLI jobs is logged in the Query Log Ingestion Job Status table at the bottom of the page. In this table, click the status link or the View Details link for a job to view the details on its progress.
Schedule QLI¶
To schedule QLI:
On the Query Log Ingestion tab, under Automated and Manual Query Log Ingestion, enable the Enable Automated Query Log Ingestion toggle.
In the Automated Query Log Ingestion Time panel that will appear, specify the values for the job recurrence and time. The values are set in your local time.
Note
The hourly schedule for automated QLI is not supported.
The next QLI will run on the schedule you set.
View Query History in the Catalog¶
After a successful QLI run, you will be able to find and view the following information under your data source:
Note
Joins, filters, and popularity calculation is a scheduled task that runs asynchronously with query log ingestion. After running QLI for the first time, you will see popularity and join and filter information in the catalog on the next day.
Query History¶
The query history information that Alation ingests during QLI is made available on the Queries tab on the catalog pages for schemas and tables.
To view query history:
Navigate to the catalog page of a data source object, such as a schema or a table.
Open the Queries tab.
Select the History section from the list on the left. All the queries ingested for this object during QLI will be listed here.
You can expand the query panels to see the full query statement and the related properties.
Filters, Joins, and Expressions¶
When ingesting query history, Alation parses SQL statements and catalogs joins and filter expressions. The joins and filters can be viewed on the Joins and Filters tabs on catalog pages of the relevant table objects. Filter expressions are captured under the Expressions tab on catalog pages of the relevant column objects.
To view the joins and filter information:
Navigate to the catalog page of a table object under your data source.
Open the Joins tab to view the relevant joins.
Open the Filters tab to view the relevant filters.
To view filter expressions:
Navigate to the catalog page of a column object under your data source.
Open the Expressions tab to view the relevant filter expressions.
Popularity¶
Popularity of objects is calculated from the number of mentions an object has in the queries parsed during query history ingestion. Tables that are more frequently accessed by users will have a greater popularity value than tables that are queried rarely. The popularity value is displayed on catalog pages of schema, table, and column objects.
Compose¶
Compose is supported from connector version 1.0.2.3423, compatible with Alation version 2022.4 or newer.
For details about configuring the Compose tab of the Settings, refer to Configure Compose for OCF Data Sources.
Note
To establish a connection between Compose and Unity Catalog, Compose users will need their own personal access token or the knowledge of the token of the service account.
Incremental MDE from Compose¶
When users create tables and views in Compose, Alation triggers a background extraction job to make the new table and view objects available in the catalog. As a result, users will see the corresponding table or view metadata in the data catalog without re-running MDE on the Settings page of the data source. When users run DDL SQL queries in Compose, incremental MDE also adds lineage information to the affected objects under the data source.
Troubleshooting¶
MDE¶
Problem: MDE fails with error “iterating schema children”¶
The issue may occur with older connector versions that use the legacy Spark driver. This driver sometimes cannot keep up the connection long enough for MDE to complete.
Solution¶
This issue was resolved starting with connector version 2.0.2.6259, where the driver was changed to the Databricks JDBC driver. See more in JDBC URI.
Lineage¶
Problem: After upgrading the connector to version 1.0.3, MDE fails at ingestion¶
Connector logs will show the error “system.access.table_lineage table does not exist”.
Solution¶
Direct lineage extraction from system tables is a beta feature. The lineage system tables are not enabled by default and need to be enabled by your Databricks admin. See more in Lineage Extraction (Beta).
Problem: No lineage is generated after MDE¶
Solution¶
Verify that lineage data is available in the system lineage tables in Databricks.
QLI¶
Problem: After running QLI, some queries don’t appear in the History section of the Queries tab¶
Explanation¶
Some SQL statements are currently not supported by the Databricks SQL parser that is used to parse SQL statements during QLI, for example:
Partially qualified names in SQL statements are not parsed. Object names in query history must be fully qualified to be ingested.
The USE statements are not parsed.
Queries containing CTEs are not parsed.
Queries containing comments (COMMENT, COMMENT ON) are not parsed
This is a limitation with the current GSP and a known issue.
Problem: After QLI, parts of a query appear as separate queries in the catalog¶
Explanation¶
To support proper query parsing, accurate execution times count, and lineage resolution, batched queries are split into individual query statements before ingestion.
Problem: Queries that appeared in Preview don’t appear in the catalog after QLI¶
Explanation¶
The Preview is not directly tied to queries that will appear in the catalog. It returns 100 statements which may be a subset of all imported queries. For queries to appear in the catalog after QLI:
Queries must contain fully qualified object names with three-level namespace (catalog.schema.table).
The objects referenced in queries must have been cataloged with MDE.
Queries must not duplicate identical queries that were ingested previously. Duplicate queries are not resolved.