Configure Sampling and Profiling for OCF Data Sources¶
Overview¶
Sampling is a data source operation that selects a sample of 10,000 rows from each table in the database to give catalog users a quick preview of the data on the table and column catalog pages. Alation supports a number of ways to retrieve data samples:
Default sampling—Alation runs a default SELECT query against all tables selected for the sampling operation. Default sampling bulk-samples tables and columns and is performed by a Data Source Admin on the data source Settings page. The data samples will be available for viewing to all users with access to the data source.
Custom query-based sampling—The default sampling queries can be overridden with custom queries for all tables and columns or for individual tables and/or columns. If a custom sampling query is provided, Alation will run this query in place of the default query. Custom query-based sampling bulk-samples tables and columns based on the respective custom queries and is performed by a Data Source Admin on the data source Settings page. The data samples will be available for viewing to all users with access to the data source.
Column profiling—Requires the Profiling V2 feature to be enabled. In addition to retrieving data samples from columns, Alation will calculate column value distribution stats.
Dynamic sampling and profiling—Individual Alation users with access to the data source can be allowed to retrieve their own table samples and column profiles after authenticating with their database credentials.
On the Settings page of a data source, parameters for sampling can be found on these three tabs:
Per-Object Parameters
Custom Settings
Data Sampling
Depending on the sampling configuration you wish to apply, you may need to modify the parameters on all three tabs before launching the sampling job:
On the Per-Objects Parameters tab, select specific objects to profile and specify add custom queries for sampling individual tables and columns.
On the Custom Settings tab, if you have enabled Profiling V2, customize the default profiling queries.
On the Data Sampling tab, schedule the sampling job, select the number of tables to sample, and perform bulk-sampling of multiple objects.
See sections below for a detailed description of each tab.
Per-Object Parameters¶
On the Per-Object Parameters tab, you can select sampling options for each level of the data object hierarchy in a data source (Schema > Table > Column). You can include or exclude schemas, tables, or columns to or from sampling in bulk.
On each level, the sampling parameters will slightly differ. The names of schemas and tables are links. Click the name of a schema or table to go down to the child object level.
Make Browsable—Hides or exposes an object and its child objects from/in the catalog search. To hide an object from being discovered, clear the Make Browsable option. Note that when the Make Browsable checkbox is selected or cleared, it may take Alation a few minutes to update the search index before the change is reflected in the search results.
Sample?—Selects objects to be included into sampling when it is performed in bulk on the Data Sampling page. By default, all objects are selected (included). Clear the checkboxes for objects you wish to exclude. Available at the schema and table levels.
Max rows to Scan—A read-only parameter available for your information. Appears on the schema and table levels. For each table selected for sampling, Alation only samples 10,000 rows.
Max rows to Store—Sets the max number of rows to be stored for a table. Available on the schema and table level. If defined on the schema level, the value is propagated to all tables in this schema. If defined on the table level, the setting applies to this specific table only. Default:
100
. Click the number for a specific object to edit it.Skip Views—Allows skipping views during sampling. Selected by default. Applies at the schema level.
Sampling Query (for table objects) or Profiling Query (for column objects)—Available at the table and at column level. See Custom Query-Based Sampling for Tables and Custom Query-Based Profiling for Columns below.
Max values to Store—Allows to set the number of column values to be stored for a sample. Available at the column level. Click the number for a specific column to edit.
Custom Query-Based Sampling for Tables¶
As a Data Source Admin, you can specify a custom query for sampling individual tables:
On the Per-Object Parameters tab, navigate to the table level of your data source by clicking the name of a schema object that contains the table you want to sample based on a custom query.
Locate the table for which you wish to provide a custom query. For this table, in the column Sampling Query, click the pencil icon to open the query editor.
In the query editor that opens, add the query and click Save.
The next sampling job you run will use the saved custom query to sample this specific table.
Note
There is no SQL validation for the custom sampling queries. If the query is incorrect, Alation will throw an error during profiling this table which will be logged in the Sampling Job Status table on the Data Sampling tab. We recommend to test run and troubleshoot the query in Compose before saving it in the catalog and running the sampling job.
Custom Query-Based Profiling for Columns¶
Requires Profiling V2 to be enabled.
You can specify a custom query for profiling individual columns:
1. On the Per-Object Parameters tab, navigate to the column you want to profile by clicking the name of the table object that contains this column. For this column, in the Profiling Query column of the table, click the pencil icon.
In the query editor that opens, specify a custom query and click Save.
The next profile you run for this column will use the saved custom query to profile it.
Note
Individual profiling queries on the Per-Object Parameters tab take precedence over the default profiling query or the custom column profiling query that may be specified for all columns on the Custom Settings tab.
Custom Settings¶
The Custom Settings tab applies if Profiling V2 is enabled. It allows customizing the default query templates for column profiling.
By default, to profile a column, Alation runs a default query. Default queries for column profiling differ depending on the data type: numeric or non-numeric. You can customize these queries to profile a column. When you customize the query, you can also customize the statistics that should be calculated and displayed.
The default query can be modified based on the user requirements and the data source type. The default query templates are given below.
Numeric Columns¶
SELECT
MIN({column_name}) AS MIN,
MAX({column_name}) AS MAX,
AVG({column_name}) AS MEAN,
(COUNT(*) - COUNT({column_name})) AS "#NULL",
(CASE WHEN COUNT(*) > 0 THEN ((COUNT(*) - COUNT({column_name})) * 100.0 / COUNT(*)) ELSE 0.0 END) AS "%NULL"
FROM {schema_name}.{table_name};
Non-numeric Columns¶
SELECT
((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) AS "#NULL",
(CASE WHEN COUNT(*) > 0 THEN ((((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) * 100.0 / count(*)) ) ELSE 0.0 END ) AS "%NULL",
((SUM(CASE WHEN {column_name} = '' THEN 1 ELSE 0 END)) ) AS "#EMPTY",
(CASE WHEN COUNT(*) > 0 THEN (((SUM(CASE WHEN {column_name} = '' THEN 1 ELSE 0 END)) * 100.0 / count(*))) ELSE 0.0 END) AS "%EMPTY"
FROM {schema_name}.{table_name};
Important
The profiling query templates always contain the {column_name}
, {schema_name}
, and {table_name}
variables.
Admin users who are setting up column profiling must check the compatibility of the default query based on the database type and modify it if required.
The default profiling query calculates the profiling stats that are displayed on the Overview tab of the column catalog page.
Data Sampling¶
On the Data Sampling tab, admin users can perform bulk sampling of multiple objects manually on-demand or enable automated sampling.
Important
If Dynamic Sampling is enabled, Automated and Manual Sampling is not available and becomes disabled in the user interface.
![]()
Manual On-Demand Sampling¶
To perform bulk sampling manually on-demand:
Ensure that the Enable Automated Sampling toggle is off.
From the All tables list, select the number of tables to sample. By default, all tables are sampled based on the configuration on the Per-Object Parameters tab. When a specific number of tables is selected, popular tables that were not sampled previously are prioritized.
Click the Sample button to kick off a sampling job.
Automated Sampling¶
To schedule the sampling job to run automatically:
Enable the Enable Automated Sampling toggle.
Set a schedule specifying values for the recurrence period, day, and time in the Automated Sampling Time section.
From the All tables list, select the number of tables to be sampled. By default, all tables are sampled based on the configuration on the Per-Object Parameters tab. When a specific number of tables is selected, popular tables that were not sampled previously are prioritized.
Sampling will be performed automatically based on the schedule you have set.
Sampling of Individual Tables and Columns by Users¶
As a Data Source Admin, you may choose not to sample catalog objects in bulk on the Data Sampling tab as it is a resource-intensive operation. Instead, catalog users with access to a data source can run data samples on demand on the Samples tab of a table object catalog page and/or on the Overview tab of a column object catalog page. The configuration on the Per-Object Parameters tab applies to the sampling of individual tables and columns by non-admin users too.
Note
If Profiling V2 is enabled on your Alation instance, then for column objects, Alation will calculate column profiles.
When users run samples or profiles for specific tables or columns, they may be required to provide their own database credentials if Dynamic Sampling is enabled for the data source.
Dynamic Sampling¶
Enabling Dynamic Sampling assumes that users should provide their own database credentials when initiating sampling for a specific table or when running a column profile. By default, Dynamic Sampling is disabled.
Important
Enabling Dynamic Sampling disables the ability to perform bulk-sampling or schedule sampling jobs: the Automated and Manual Sampling section of the interface will become disabled.
To enable Dynamic Sampling, enable the Enable dynamic sampling/profiling toggle under the Dynamic Sampling section on the Data Sampling tab.