Redshift OCF Connector: Install and Configure

Preliminaries

Firewall Configuration

  • Open outbound TCP port 5439 to the Redshift server

Driver

The driver for the Redshift connector is available in Alation by default and does not require installation. Refer to the Support Matrix for your Alation release to find out the version of the available driver for Redshift.

Create Service Account

Note

Authentication using AWS IAM account is not supported.

Sample SQL to create an account

CREATE USER alation WITH PASSWORD 'password';
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA schema_name TO alation;

Permissions for Metadata Extraction

grant usage on schema pg_catalog to alation;
grant select on all tables in schema pg_catalog to alation;

Permission

Purpose

grant usage on schema pg_catalog to alation;

Required for table and view extraction.

grant select on all tables in schema pg_catalog to alation;

Required for table, view, function, and function definition extraction.

Permissions for Table Profiling

GRANT SELECT ON schema_name.table_name TO alation;

Permissions for Query Log Ingestion

GRANT SELECT ON <qli_schema_name>.<qli_table_name> to alation;

JDBC URI

When building the URI, include the following components:

  • Hostname or IP of the instance

  • Port Number

  • Database Name

URI format:

redshift://<Host>:<Port>/<Datbase_Name>

Example:

redshift://redshift-alation-se.ct5o26v.us-east-5.redshift.amazonaws.com:5439/dev

Configuration in Alation

STEP 1: Install the Connector

Important

Installation of the Redshift OCF connector requires the Alation Connector Manager to be installed as a prerequisite.

  1. If this has not been done on your instance, install the Connector Manager: Install Alation Connector Manager.

  2. Make sure that the connector Zip file which you received from Alation is available on your local machine.

  3. Install the connector on the Connectors Dashboard page: refer to Manage Connector Dashboard.

STEP 2: Create and Configure a New Redshift Data Source

  1. Log in to the Alation instance and add a new Redshift source. Click on Apps > Sources > Add > Data Source.

  2. Provide the Title for data source and click on Continue Setup.

  3. From the Database Type dropdown, select Redshift OCF Connector. You will be navigated to the Settings page of your new Redshift OCF data source.

../../../_images/RedshiftOCF_01.png

Access

On the Access tab, set the data source visibility as follows:

  • Public Data Source - The data source will be visible to all users of the catalog.

  • Private Data Source - The data source will be visible to the users allowed access to the data source by Data Source Admins.

Add new Data Source Admin users in the Data Source Admins section.

General Settings

Perform the configuration on the General Settings tab:

  1. Specify Application Settings:

    Parameter

    Description

    BI Connection Info

    Not applicable

    Disable Automatic Lineage Generation

    Not applicable

  2. Click Save.

  3. Specify Connector Settings:

    Parameter

    Description

    Data Source Connection

    JDBC URI

    Provide the JDBC URI constructed in JDBC URI.

    Username

    Provide the service account username.

    Password

    Provide the service account password.

    Enable SSL

    Enable or disable SSL authentication by selecting or clearing the Enable SSL checkbox.

    If the Enable SSL checkbox is enabled, upload the SSL certificate using the upload link below.

    Truestore Password

    Provide the password for the SSL certificate.

    Note

    The password will be deleted if the data source connection is deleted.

    Logging Information

    Log Level

    Select the Log Level to generate logs. The available log levels are based on the log4j framework.

  4. Click Save.

  5. Obfuscate Literals - Enable this toggle to hide the details of the queries in the catalog page that are ingested via QLI or executed in Compose. This toggle is disabled by default.

  6. Under Test Connection, click Test to validate network connectivity.

Deleting the Data Source

You can delete your data source from the General Settings tab. Under Delete Data Source, click Delete to delete the data source connection.

../../../_images/RedshiftOCF_02.png

Metadata Extraction

You can perform a default extraction or configure extraction based on custom queries.

  • Default extraction: This type of MDE is based on default SQL queries that are built in the connector code.

  • Query-based extraction: This type of MDE is based on custom SQL queries that can be provided by a Data Source Admin. See Query Based Extraction below.

Application Settings

  • Enable Raw Metadata Dump or Replay: The options in this drop list can be used to dump the extracted metadata into files in order to debug extraction issues before ingesting the metadata into Alation. This feature can be used during testing in case there are issues with MDE. It breaks extraction into two steps: first, the extracted metadata is dumped into files and can be viewed; and second, it can be ingested from the files into Alation. It is recommended to keep this feature enabled only if debugging is required.

    • Enable Raw Metadata Dump: Select this option to save the extracted metadata into a folder for debugging purposes. The dumped data will be saved in four files (attribute.dump, function.dump, schema.dump, table.dump) in the folder opt/alation/site/tmp/ inside Alation shell.

    • Enable Ingestion Replay: Select this option to ingest the metadata from the dump files into Alation.

    • Off - Disable the Raw Metadata Dump or Replay feature. Extracted metadata will be ingested into Alation.

Connector Settings

Query Based Extraction

Query-Based Extraction allows users to customize metadata extraction from the source down to the level of specific metadata types, such as tables, columns, views, and some others by using custom queries.

Tables, columns, views, and function definitions are enabled by default at the backend and users cannot disable them.

By default, the following additional metadata types are enabled:

  • Primary keys

  • Foreign keys

  • Functions

Users can disable the metadata types that are not required by clearing the corresponding checkboxes.

System schemas are disabled by default. Users can enable their extraction if required.

To use this feature, you will need to write custom queries to extract the metadata. Alation expects that these queries conform to a specific structure and use the expected reserved identifiers. The sections below provide the expected query structure for each metadata type.

Catalog

Make sure that your query has a column labeled as CATALOG in the SELECT statement.

Example:

SELECT
  distinct CURRENT_DATABASE() AS CATALOG
FROM
  PG_NAMESPACE
WHERE
  lower(nspname) NOT LIKE  'pg_%%'
  AND lower(nspname) NOT IN   ( 'information_schema')
  AND lower(nspname)  NOT IN ('''')
ORDER BY
  CATALOG;
Schema

Make sure that your query has a column labeled as CATALOG, SCHEMA in the SELECT statement.

Example:

SELECT
    CURRENT_DATABASE() AS CATALOG,
    nspname AS SCHEMA
FROM
    PG_NAMESPACE
WHERE
    lower(nspname) NOT LIKE  'pg_%%'
    AND lower(nspname) NOT IN   ( 'information_schema')
    AND lower(nspname)  NOT IN ('''')
ORDER BY
    SCHEMA;
Table

Make sure that your query has columns labeled as CATALOG, SCHEMA, TABLE_NAME, TABLE_TYPE, and REMARKS in the SELECT statement.

Example:

SELECT
  *
FROM
  (
  SELECT
          CURRENT_DATABASE() AS CATALOG,
          A.SCHEMA AS SCHEMA,
          A.TABLE_NAME,
          A.TABLE_OWNER,
          A.TABLE_TYPE,
          pgd.description AS REMARKS
  FROM
          (
          SELECT
                  T.SCHEMANAME AS SCHEMA,
                  T.TABLENAME AS TABLE_NAME,
                  T.TABLEOWNER AS TABLE_OWNER,
                  'TABLE' AS TABLE_TYPE
          FROM
                  PG_TABLES T
  UNION
          SELECT
                  V.SCHEMANAME AS SCHEMA,
                  V.VIEWNAME AS VIEW_NAME,
                  V.VIEWOWNER AS TABLE_OWNER,
                  'VIEW' AS TABLE_TYPE
          FROM
                  PG_VIEWS V ) A
  LEFT JOIN PG_CATALOG.pg_statio_all_tables c on
          (c.relname = A.table_name
          and c.schemaname = A.SCHEMA)
  LEFT JOIN PG_CATALOG.pg_description pgd on
          (pgd.objoid = c.relid)
  WHERE
          A.SCHEMA NOT IN (
          SELECT
                  schemaname
          from
                  svv_external_schemas)
          AND lower(A.SCHEMA) NOT LIKE  'pg_%%'
          AND (pgd.objsubid = 0)
UNION
  SELECT
          CATALOG,
          SCHEMA,
          TABLE_NAME,
          TABLE_OWNER,
          TABLE_TYPE,
          REMARKS
  from
          (
          SELECT
                  CURRENT_DATABASE() AS CATALOG,
                  A.SCHEMA as SCHEMA,
                  A.TABLE_NAME as TABLE_NAME,
                  A.TABLE_OWNER as TABLE_OWNER,
                  A.TABLE_TYPE as TABLE_TYPE,
                  null AS REMARKS,
                  pgd.objoid as objoid
          FROM
                  (
                  SELECT
                          T.SCHEMANAME AS SCHEMA,
                          T.TABLENAME AS TABLE_NAME,
                          T.TABLEOWNER AS TABLE_OWNER,
                          'TABLE' AS TABLE_TYPE
                  FROM
                          PG_TABLES T
          UNION
                  SELECT
                          V.SCHEMANAME AS SCHEMA,
                          V.VIEWNAME AS VIEW_NAME,
                          V.VIEWOWNER AS TABLE_OWNER,
                          'VIEW' AS TABLE_TYPE
                  FROM
                          PG_VIEWS V ) A
          LEFT JOIN PG_CATALOG.pg_statio_all_tables c on
                  (c.relname = A.table_name
                  and c.schemaname = A.SCHEMA)
          LEFT JOIN PG_CATALOG.pg_description pgd on
                  (pgd.objoid = c.relid)
          WHERE
                  A.SCHEMA NOT IN (
                  SELECT
                          schemaname
                  from
                          svv_external_schemas)
                  AND lower(A.SCHEMA) NOT LIKE  'pg_%%' ) tt
  where
          not exists(
          select
                  1
          from
                  PG_CATALOG.pg_description pt
          where
                  pt.objoid = tt.objoid
                  and pt.objsubid = 0)
UNION
  SELECT
          CURRENT_DATABASE() AS CATALOG,
          A.schemaname AS SCHEMA,
          A.tablename AS TABLE_NAME,
          null as TABLE_OWNER,
          'TABLE' as TABLE_TYPE,
          null as REMARKS
  from
          SVV_EXTERNAL_TABLES A
  join svv_external_schemas B on
          A.schemaname = B.schemaname )
WHERE
  lower(SCHEMA) NOT IN   ( 'information_schema')

  AND lower(SCHEMA) NOT IN ('''')
 AND TABLE_TYPE != 'VIEW'ORDER BY
  TABLE_TYPE,
  SCHEMA,
  TABLE_NAME;
View

Make sure that your query has columns labeled as CATALOG, SCHEMA, VIEW_NAME, VIEW_CREATE_STATEMENT, ‘VIEW’ AS VIEW_TYPE, and REMARKS in the SELECT statement.

Example:

SELECT
  CURRENT_DATABASE() AS CATALOG,
  schemaname AS SCHEMA,
  viewname AS VIEW_NAME,
  CASE
  WHEN lower(definition) like 'create materialized view%' THEN definition
  WHEN lower(definition) like 'create or replace materialized view%' THEN definition
  WHEN lower(definition) like 'create view%' THEN definition
  WHEN lower(definition) like 'create or replace view%' THEN definition
  ELSE 'CREATE OR REPLACE VIEW ' || SCHEMA || '.' || VIEW_NAME || ' AS ' || definition
  END as VIEW_CREATE_STATEMENT,
  'VIEW' AS VIEW_TYPE,
  VIEWOWNER AS TABLE_OWNER,
  null AS REMARKS
FROM
  PG_VIEWS
WHERE
  lower(schemaname) NOT LIKE  'pg_%%'
  AND lower(schemaname) NOT IN   ( 'information_schema')

  AND lower(schemaname)  NOT IN ('''')
ORDER BY
  SCHEMA,
  VIEW_NAME;
Column

Make sure that your query has columns labeled as CATALOG, SCHEMA, TABLE_NAME, TYPE_NAME, DATA_TYPE, COLUMN_NAME, ORDINAL_POSITION, IS_NULLABLE, REMARKS, and COLUMN_DEFAULT in the SELECT statement.

Example:

select
  table_catalog as CATALOG,
  table_schema as SCHEMA,
  table_name,
  column_name,
  data_type,
  CASE
          WHEN data_type = 'character varying' THEN 'varchar'
          WHEN data_type = 'timestamp without time zone' THEN 'timestamp'
          WHEN data_type = 'timestamp with time zone' THEN 'timestamptz'
          WHEN data_type = 'double precision' THEN 'FLOAT'
          ELSE data_type
  END as type_name,
  CASE
          WHEN data_type = 'character varying' THEN character_maximum_length
          WHEN data_type = 'numeric' THEN numeric_precision
          WHEN data_type = 'character' THEN character_maximum_length
          ELSE -1
  END as length,
  numeric_scale,
  ordinal_position,
  is_nullable,
  column_default,
  remarks
from
  svv_columns
WHERE
  lower(SCHEMA) NOT LIKE  'pg_%%'
  AND lower(SCHEMA) NOT IN   ( 'information_schema')

  AND lower(SCHEMA) NOT IN ('''')
ORDER BY
  CATALOG,
  SCHEMA,
  ORDINAL_POSITION;
Primary Key

Make sure that your query has columns labeled as CATALOG, SCHEMA, TABLE_NAME, COLUMN_NAME and OWNER in the SELECT statement.

Example:

SELECT
  CURRENT_DATABASE() AS CATALOG,
  S.NSPNAME AS SCHEMA,
  C.RELNAME AS TABLE_NAME,
  CONSTR.CONNAME AS PK_NAME,
  ATTR.ATTNAME AS COLUMN_NAME,
  ATTR.ATTNUM AS KEY_SEQ
FROM
  PG_CONSTRAINT CONSTR
INNER JOIN PG_ATTRIBUTE ATTR ON
  ATTR.ATTNUM = ANY (CONSTR.CONKEY)
  AND ATTR.ATTRELID = CONSTR.CONRELID
INNER JOIN PG_CLASS C ON
  CONSTR.CONRELID = C.OID
INNER JOIN PG_NAMESPACE S ON
  C.RELNAMESPACE = S.OID
WHERE
  lower(S.NSPNAME) NOT LIKE   'pg_%%'
  AND lower(S.NSPNAME) NOT IN ('''')
  AND lower(S.NSPNAME) NOT IN   ( 'information_schema')

  AND CONSTR.CONTYPE = 'p'
ORDER BY
  SCHEMA,
  TABLE_NAME,
  PK_NAME,
  KEY_SEQ;
Foreign Key

Make sure that your query has columns labeled as PK_CATALOG, PK_SCHEMA, PK_TABLE, PK_COLUMN, FK_CATALOG, FK_SCHEMA, FK_TABLE, and FK_COLUMN in the select list.

Example:

SELECT
  CURRENT_DATABASE() AS FK_CATALOG,
  CURRENT_DATABASE() AS PK_CATALOG,
  F.FKTABLE_SCHEM AS FK_SCHEMA,
  F.FKTABLE_NAME AS FK_TABLE,
  F.FK_NAME,
  F.FKCOLUMN_NAME AS FK_COLUMN,
  F.KEY_SEQ,
  P.PKTABLE_SCHEM AS PK_SCHEMA,
  P.PKTABLE_NAME AS PK_TABLE,
  P.PK_NAME,
  P.PKCOLUMN_NAME AS PK_COLUMN
FROM
  (
  SELECT
          S.NSPNAME AS FKTABLE_SCHEM,
          C.RELNAME AS FKTABLE_NAME,
          CONSTR.CONNAME AS FK_NAME,
          ATTR.ATTNAME AS FKCOLUMN_NAME,
          ATTR.ATTNUM AS KEY_SEQ,
          CONSTR.CONFKEY,
          CONSTR.CONFRELID
  FROM
          PG_CONSTRAINT CONSTR
  INNER JOIN PG_ATTRIBUTE ATTR ON
          ATTR.ATTNUM = ANY (CONSTR.CONKEY)
          AND ATTR.ATTRELID = CONSTR.CONRELID
  INNER JOIN PG_CLASS C ON
          CONSTR.CONRELID = C.OID
  INNER JOIN PG_NAMESPACE S ON
          C.RELNAMESPACE = S.OID
          AND CONSTR.CONTYPE = 'f'
  WHERE
          lower(S.NSPNAME) NOT LIKE  'pg_%%'
          AND lower(S.NSPNAME)  NOT IN ('''')
          AND lower(S.NSPNAME) NOT IN   ( 'information_schema')
 ) F
INNER JOIN (
  SELECT
          S.NSPNAME AS PKTABLE_SCHEM,
          C.RELNAME AS PKTABLE_NAME,
          CONSTR.CONNAME AS PK_NAME,
          ATTR.ATTNAME AS PKCOLUMN_NAME,
          ATTR.ATTNUM AS PKEY_SEQ,
          C.OID
  FROM
          PG_CONSTRAINT CONSTR
  INNER JOIN PG_ATTRIBUTE ATTR ON
          ATTR.ATTNUM = ANY (CONSTR.CONKEY)
          AND ATTR.ATTRELID = CONSTR.CONRELID
  INNER JOIN PG_CLASS C ON
          CONSTR.CONRELID = C.OID
  INNER JOIN PG_NAMESPACE S ON
          C.RELNAMESPACE = S.OID
          AND CONSTR.CONTYPE = 'p'
  WHERE
          lower(S.NSPNAME) NOT LIKE   'pg_%%'
          AND lower(S.NSPNAME) NOT IN ('''')
          AND lower(S.NSPNAME) NOT IN   ( 'information_schema')
 ) P ON
  P.PKEY_SEQ = ANY (F.CONFKEY)
  AND P.OID = F.CONFRELID
ORDER BY
  FK_SCHEMA,
  FK_TABLE,
  FK_NAME,
  KEY_SEQ;
Function

Make sure that your query has columns labeled as CATALOG, SCHEMA, FUNCTION_NAME, and REMARKS in the select list.

Example:

SELECT
  CURRENT_DATABASE() AS CATALOG,
  S.NSPNAME AS SCHEMA,
  P.PRONAME AS FUNCTION_NAME,
  NULL AS SPECIFIC_NAME,
  NULL AS REMARKS
FROM
  PG_PROC P
INNER JOIN PG_NAMESPACE S ON
  P.PRONAMESPACE = S.OID
INNER JOIN PG_LANGUAGE L ON
  P.PROLANG = L.OID
WHERE
  (L.LANNAME IN ('internal',
  'sql')
  AND p.prorettype <> 'pg_catalog.trigger'::pg_catalog.regtype
  OR L.LANNAME IN ('plpythonu'))
  AND lower(S.NSPNAME) NOT IN ('''')
  AND lower(S.NSPNAME) NOT LIKE  'pg_%%'
  AND lower(S.NSPNAME) NOT IN   ( 'information_schema')

ORDER BY
  CATALOG,
  SCHEMA,
  FUNCTION_NAME;
Function Definition

Make sure that your query has columns labeled as CATALOG, SCHEMA, FUNCTION_NAME, ARG_NAME, TYPE_NAME, ARG_TYPE, and ARG_DEF in the select list.

Example:

SELECT
  CURRENT_DATABASE() AS CATALOG,
  N.NSPNAME AS SCHEMA,
  P.PRONAME AS FUNCTION_NAME,
  ISNULL(P.PROARGNAMES,'{}')  AS ARG_NAME,
  P.PRONARGS AS NUM_OF_FUNCTION_ARGS,
  ISNULL(STRING_TO_ARRAY(REPLACE(OIDVECTORTYPES(P.PROARGTYPES), '\"', ''), ','),'{}') AS ARG_TYPE,
  FORMAT_TYPE(P.PRORETTYPE, NULL) AS TYPE_NAME,
  P.PROSRC AS ARG_DEF
FROM
  PG_PROC P
INNER JOIN PG_NAMESPACE N ON
  P.PRONAMESPACE = N.OID
INNER JOIN PG_LANGUAGE L ON
  P.PROLANG = L.OID
WHERE
  (L.LANNAME IN ('internal',
  'sql')
  AND p.prorettype <> 'pg_catalog.trigger'::pg_catalog.regtype
  OR L.LANNAME IN ('plpythonu'))
  AND lower(N.NSPNAME) NOT IN ('''')
  AND lower(N.NSPNAME) NOT LIKE 'pg_%%'
  AND lower(N.NSPNAME) NOT IN ('information_schema')
ORDER BY
  CATALOG,
  SCHEMA,
  FUNCTION_NAME;

Selective Extraction

On the Metadata Extraction tab, you can select the Schemas to include or exclude from extraction. Selective extraction settings are used to apply a filter to include or exclude a list of schemas.

Enable the Selective Extraction toggle if you want only a subset of schemas to be extracted.

  1. Click Get List of Schemas to first fetch the list of schemas. The status of the Get Schemas action will be logged in the Extraction Job Status table at the bottom of the Metadata Extraction page.

  2. When Schema synchronization is complete, a drop-down list of the Schemas will become enabled.

  3. Select one or more schemas as required.

  4. Check if you are using the desired filter option. Available filter options are described below:

Filter Option

Description

Extract all Schemas except

Extract metadata from all Schemas except from the selected Schemas.

Extract only these Schemas

Extract metadata only from the selected Schemas.

  1. Click Run Extraction Now to extract metadata. The status of the extraction action is also logged in the Job History table at the bottom of the page.

Automated Extraction

If you wish to automatically update the metadata extracted into the Catalog, under Automated and Manual Extraction, turn on the Enable Automated Extraction switch and select the day and time when metadata must be extracted. The metadata extraction will be automatically scheduled to run on the selected schedule.

Compose

On the Compose tab, an admin can enable the use of the Compose tool for this data source.

  1. Before you enable Compose, provide the JDBC URI in the Default Connection field which Compose will use as a default connection and Save.

    ../../../_images/RedshiftOCF_03.png
  2. Enable the Enabled in Compose toggle to enable Compose for this data source. Select Compose Connection Sharing option based on the description in the table:

    Compose Connection Option

    Description

    Shared connections across tabs

    This option lets users use the same connection across multiple Compose tabs.

    Separate connection per tab

    Users can use different connections for each Compose tab, which enables them to run multiple queries at the same time.

  3. Select a Data Uploader option based on the description below:

    Data Uploader

    Description

    Use Global Setting (True)

    Or

    Use Global Setting (False)

    Use the global setting option that is set in alation_conf using alation.data_uploader.enabled flag.

    Users can upload data if the flag is set to true or if the flag is set to false, users cannot upload the data for any data source.

    Enable for this data source

    Use this option to enable the data upload for this data source and override the global setting if the global setting in alation_conf if it is set to false.

    Disable for this data source

    Use this option to disable the data upload for this data source and override the global setting in alation_conf if it is set to true.

    Note

    OAuth connection is not supported for this data source.

    Note

    SSO authentication is not supported for this data source.

Data Sampling

Automated and Manual Sampling

Users can either perform manual sampling or enable automated sampling:

  1. To perform manual sampling, make sure that the Enable Automated Sampling toggle is Off. Click the Sample button to do manual sampling.

  2. Set the Enable Automated Sampling toggle to On to perform the automated sampling.

    1. Set a schedule in the corresponding fields of the schedule section, specify values for week, day and time.

    2. Select the number of tables to be sampled by selecting a value in the dropdown list. By default, all tables are sampled. When a specific number of tables is selected, unprofiled and popular tables are prioritized.

    3. Click Sample.

Per-Object Parameters

Refer to Per-Object Parameters.

Query Log Ingestion

You can either create a table for Alation to pull the query logs from or use a custom query to query the logs from the Redshift data source.

Before performing the QLI, perform the QLI setup. Refer to Redshift QLI Setup.

Connector Settings

Table-Based QLI

In the Table Name field, provide the name of the table in which the query logs are available. The table name must be provided in the following format:

“database.table/viewname”

Custom Query-Based QLI

When you cannot create a table or view, you can use a Custom QLI Query to perform QLI. Provide the expected query structure as shown below and click Save:

SELECT
    user_name as username,
    to_char(start_time, 'YYYY-MM-DD HH:MI:SS.US') as starttime,
    text as queryString,
    session_id as sessionid,
    seconds_taken as seconds,
    'false' as cancelled,
    default_database as defaultDatabases,
    split_part(session_id, '/', 2) as sessionstarttime,
    seq
FROM
    public.alation_qlog
WHERE
starttime between STARTTIME AND ENDTIME AND
queryString IS NOT NULL AND
queryString <> '' AND
username != 'rdsdb'
ORDER BY starttime,username,seq;

Automated and Manual QLI

Users can either perform manual Query Log Ingestion or enable automated Query Log Ingestion:

  1. To perform manual Query Log Ingestion, make sure that the Enable Automated Query Log Ingestion toggle is Off. Click the Import button to do manual Query Log Ingestion.

  2. Set the Enable Automated Query Log Ingestion toggle to On to perform the automated Query Log Ingestion.

    1. Set a schedule in the corresponding fields of the schedule section, specify values for week, day and time.

    2. Click Import.

Custom Settings

This configuration option is available if Profiling V2 is enabled.

To profile a column, Alation runs a default query if no custom query is specified for this column. Default queries for column profiling differ depending on the data type: numeric or non-numeric.

The default query supplied by Alation can be modified based on the user requirements and datasource type. The following default query template can be used to revert back to the default query from a custom query:

Numeric columns:

SELECT
  MIN({column_name}) AS MIN,
  MAX({column_name}) AS MAX,
  AVG({column_name}) AS MEAN,
  (COUNT(*) - COUNT({column_name})) AS "#NULL",
  (CASE WHEN COUNT(*) > 0 THEN ((COUNT(*) -  COUNT({column_name})) * 100.0 / COUNT(*)) ELSE 0.0 END) AS "%NULL"
  FROM {schema_name}.{table_name};

Non numeric columns:

SELECT
      ((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) AS "#NULL",
      (CASE WHEN COUNT(*) > 0 THEN ((((SUM(CASE WHEN {column_name} IS NULL THEN 1 ELSE 0 END))) * 100.0 / count(*)) ) ELSE 0.0 END ) AS "%NULL"

FROM {schema_name}.{table_name};

Important

The profiling queries MUST contain {column_name}, {schema_name}, and {table_name}. Users must check the compatibility of the default query based on the datasource type and modify it if required.

The default profiling query calculates the Profiling stats that are displayed on the Overview tab of the Column catalog page. When you customize the query, you can also customize the statistics that should be calculated and displayed:

../../../_images/MySQLOCF_06.png

Troubleshooting

Refer to Troubleshooting.