Schema Path Pattern

In large scale analytics systems or data lakes, data representing a single logical schema is stored as a multi-file dataset. For example:

"/path/to/sample.parquet/part-1.snappy.parquet",
"/path/to/sample.parquet/part-2.snappy.parquet",
.
.
.
"/path/to/sample.parquet/part-n.snappy.parquet",

In this scenario, sample.parquet is a single logical schema with data stored in file part-1 to part-n. Each of these files have the same set of columns. In the default case of Schema extraction, “Use Schema Path pattern” will be unchecked and each of these files will be read to extract columns and columns will be cataloged on corresponding catalog pages for each file. This is a time intensive operation resulting in a lot of redundant data being read (redundant data here is the same set of columns read for each file part-1 to part-n).

Schema Path pattern is the mechanism to discover schemas (example - sample.parquet) as a logical schema optimizing the amount of data read and providing better usability.

To use Schema Path pattern enable the “Use Schema Path Pattern” option and provide a relevant text pattern in the “Schema Path Pattern” field as described.

Populating the “Schema Path Pattern” field according to the file path pattern for your data:

Scenario 1

Single logical schema with multi-file dataset:

container/warehouse/production-database/sales_data/sales_data1.parquet
container/warehouse/production-database/sales_data/sales_data2.parquet
…
container/warehouse/production-database/sales_data/sales_datan.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 1 as per the below examples:

  • sales_data

  • production-database/sales_data

Detected Logical Schema

container/warehouse/production-database/sales_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/sales_data folder by reading the columns for file sales_data1.parquet.

Scenario 2

Multiple logical schemas with corresponding multi-file dataset. All logical schemas at same directory depth:

container/warehouse/production-database/sales_data/sales_data1.parquet
container/warehouse/production-database/sales_data/sales_data2.parquet
...
container/warehouse/production-database/sales_data/sales_datan.parquet

container/warehouse/production-database/product_data/product_data_1.parquet
container/warehouse/production-database/product_data/product_data_2.parquet
...
container/warehouse/production-database/product_data/product_data_n.parquet

Expected behaviour for this kind of scenario is to discover sales_data and product_data as two logical schemas.

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 2 as per the below examples:

  • production-database/.*

  • production-database/(.*)_data

  • production-database/(sales|product)_data

Detected Logical Schemas

  • container/warehouse/production-database/sales_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/sales_data folder by reading the columns for file sales_data1.parquet.

  • container/warehouse/production-database/product_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/product_data folder by reading the columns for file product_data_1.parquet.

Scenario 3

Multiple logical schemas with corresponding multi-file dataset. Each unique directory at the deepest level for each multi-file dataset to be discovered as a logical schema.

container/warehouse/production-database/sales_data/sales_data1.parquet
container/warehouse/production-database/sales_data/sales_data2.parquet
...
container/warehouse/production-database/sales_data/sales_datan.parquet

container/warehouse/production-database/product_data/product_data_1.parquet
container/warehouse/production-database/product_data/product_data_2.parquet
...
container/warehouse/production-database/product_data/product_data_n.parquet

container/warehouse/production-database/customer/customer_data/customer_data_1.parquet
container/warehouse/production-database/customer/customer_data/customer_data_1.parquet
...
container/warehouse/production-database/customer/customer_data/customer_data_n.parquet

Expected behaviour for this kind of scenario is to discover sales_data, product_data and customer_data as logical schemas.

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 3 as per the below example:

  • production-database/.*

Detected Logical Schemas

  • container/warehouse/production-database/sales_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/sales_data folder by reading the columns for file sales_data1.parquet.

  • container/warehouse/production-database/product_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/product_data folder by reading the columns for file product_data_1.parquet.

  • container/warehouse/production-database/customer/customer_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/customer/customer_data folder by reading the columns for file customer_data_1.parquet.

Scenario 4

Single logical schema with multi-file dataset, data divided by year/month/day.

container/warehouse/production-database/inventory/2021/01/01/data.parquet
container/warehouse/production-database/inventory/2021/01/02/data.parquet
..
container/warehouse/production-database/inventory/2021/01/31/data.parquet
container/warehouse/production-database/inventory/2021/02/01/data.parquet
container/warehouse/production-database/inventory/2021/02/02/data.parquet
...
container/warehouse/production-database/inventory/2021/02/28/data.parquet
...
...
container/warehouse/production-database/inventory/2021/12/31/data.parquet

container/warehouse/production-database/inventory/2022/01/01/data.parquet
container/warehouse/production-database/inventory/2022/01/02/data.parquet
..
container/warehouse/production-database/inventory/2022/01/31/data.parquet
container/warehouse/production-database/inventory/2022/02/01/data.parquet
container/warehouse/production-database/inventory/2022/02/02/data.parquet
...
container/warehouse/production-database/inventory/2022/02/28/data.parquet
...
...
container/warehouse/production-database/inventory/2022/12/31/data.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 4 as per the below examples:

  • production-database/inventory

  • production-database/[a-z0-9A-Z]*

Detected Logical Schemas

container/warehouse/production-database/inventory - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/inventory folder by reading the columns for file container/warehouse/production-database/inventory/2021/01/01/data.parquet.

Scenario 5

Multiple logical schemas with multi-file dataset, data divided by year/month/day.

container/warehouse/production-database/inventory/2021/01/01/data.parquet
container/warehouse/production-database/inventory/2021/01/02/data.parquet
..
container/warehouse/production-database/inventory/2021/01/31/data.parquet
container/warehouse/production-database/inventory/2021/02/01/data.parquet
container/warehouse/production-database/inventory/2021/02/02/data.parquet
...
container/warehouse/production-database/inventory/2021/02/28/data.parquet
...
...
container/warehouse/production-database/inventory/2021/12/31/data.parquet

container/warehouse/production-database/inventory/2022/01/01/data.parquet
container/warehouse/production-database/inventory/2022/01/02/data.parquet
..
container/warehouse/production-database/inventory/2022/01/31/data.parquet
container/warehouse/production-database/inventory/2022/02/01/data.parquet
container/warehouse/production-database/inventory/2022/02/02/data.parquet
...
container/warehouse/production-database/order/2022/02/28/data.parquet
...
...
container/warehouse/production-database/order/2022/12/31/data.parquet


container/warehouse/production-database/order/2021/01/01/data.parquet
container/warehouse/production-database/order/2021/01/02/data.parquet
..
container/warehouse/production-database/order/2021/01/31/data.parquet
container/warehouse/production-database/order/2021/02/01/data.parquet
container/warehouse/production-database/order/2021/02/02/data.parquet
...
container/warehouse/production-database/order/2021/02/28/data.parquet
...
...
container/warehouse/production-database/order/2021/12/31/data.parquet

container/warehouse/production-database/order/2022/01/01/data.parquet
container/warehouse/production-database/order/2022/01/02/data.parquet
..
container/warehouse/production-database/order/2022/01/31/data.parquet
container/warehouse/production-database/order/2022/02/01/data.parquet
container/warehouse/production-database/order/2022/02/02/data.parquet
...
container/warehouse/production-database/order/2022/02/28/data.parquet
...
...
container/warehouse/production-database/order/2022/12/31/data.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 5 as per the below example:

  • production-database/([a-z0-9A-Z]*)

Detected Logical Schemas

  • container/warehouse/production-database/inventory - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/inventory folder by reading the columns for file container/warehouse/production-database/inventory/2021/01/01/data.parquet.

  • container/warehouse/production-database/order - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/order folder by reading the columns for file container/warehouse/production-database/order/2022/02/28/data.parquet.

Scenario 6

Multiple logical schemas with multi-file dataset, each unique directory under a specific directory to be identified as logical schema.

container/folder/engineering/data/emp1_data/2022/01/1.parquet
container/folder/engineering/data/emp1_data/2022/02/1.parquet
container/folder/engineering/data/emp1_data/2022/03/1.parquet
container/folder/engineering/data/emp1_data/2022/04/1.parquet

container/folder/engineering/data/emp2_data/2022/01/1.parquet
container/folder/engineering/data/emp2_data/2022/02/1.parquet
container/folder/engineering/data/emp2_data/2022/03/1.parquet
container/folder/engineering/data/emp2_data/2022/04/1.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 6 as per the below example:

  • engineering/data/([a-z_0-9]*)

Detected Logical Schemas

  • container/folder/engineering/data/emp1_data - Columns for this logical schema will be cataloged on the catalog page for container/folder/engineering/data/emp1_data folder by reading the columns for file container/folder/engineering/data/emp1_data/2022/01/1.parquet.

  • container/folder/engineering/data/emp2_data - Columns for this logical schema will be cataloged on the catalog page for container/folder/engineering/data/emp2_data folder by reading the columns for file container/folder/engineering/data/emp2_data/2022/01/1.parquet.

Scenario 7

Multiple logical schemas with multi-file dataset, complex path pattern.

container/warehouse/production-database/sales_data/sales_data1.parquet
container/warehouse/production-database/sales_data/sales_data2.parquet
...
container/warehouse/production-database/sales_data/sales_datan.parquet

container/warehouse/production-database/product_data/product_data_1.parquet
container/warehouse/production-database/product_data/product_data_2.parquet
...
container/warehouse/production-database/product_data/product_data_n.parquet


container/warehouse/dev-database/sales_data/sales_data1.parquet
container/warehouse/dev-database/sales_data/sales_data2.parquet
...
container/warehouse/dev-database/sales_data/sales_datan.parquet

container/warehouse/dev-database/product_data/product_data_1.parquet
container/warehouse/dev-database/product_data/product_data_2.parquet
...
container/warehouse/dev-database/product_data/product_data_n.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 7 as per the below examples:

  • (production|dev)-database/.*

  • production-database/(sales_data|product_data)|(dev-database/(sales|product)_data)

Detected Logical Schemas

  • container/warehouse/production-database/sales_data

  • container/warehouse/production-database/product_data

  • container/warehouse/dev-database/sales_data

  • container/warehouse/dev-database/product_data

Scenario 8

Multiple logical schemas with multi-file dataset with hive style column partitioning.

container/warehouse/production-database/sales_data/city=IN/state=GJ/sales_data1.parquet
container/warehouse/production-database/sales_data/city=IN/state=GJ/sales_data2.parquet
container/warehouse/production-database/sales_data/city=IN/state=MH/sales_data1.parquet
container/warehouse/production-database/sales_data/city=IN/state=MH/sales_data2.parquet
...
container/warehouse/production-database/sales_data/city=XX/state=XX/sales_datan.parquet

container/warehouse/production-database/product_data/year=2022/month=01/product_data_1.parquet
container/warehouse/production-database/product_data/year=2022/month=01/product_data_2.parquet
container/warehouse/production-database/product_data/year=2022/month=02/product_data_1.parquet
container/warehouse/production-database/product_data/year=2022/month=02/product_data_2.parquet
...
container/warehouse/production-database/product_data/year=XXX/month=XXX/product_data_n.parquet

The text string needs to be provided as input in the “Schema Path Pattern” field for Scenario 8 as per the below examples:

  • production-database/(sales|product)_data

  • production-database/(.*)_data

  • production-database/[^=]*

Detected Logical Schemas

  • container/warehouse/production-database/sales_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/sales_data folder by reading the columns for file container/warehouse/production-database/sales_data/city=IN/state=GJ/sales_data1.parquet. Additionally, column partitions city and state will also be cataloged as columns.

  • container/warehouse/production-database/product_data - Columns for this logical schema will be cataloged on the catalog page for container/warehouse/production-database/product_data folder by reading the columns for file container/warehouse/production-database/product_data/year=2022/month=01/product_data_1.parquet. Additionally, column partitions year and month will also be cataloged as columns.

Scenario 9

container/warehouse/production-database/inventory/spend/xyz/2021/01/01/data.parquet

container/warehouse/production-database/inventory/def/2021/01/01/data.parquet

container/warehouse/production-database/inventory1/2021/01/01/data.parquet

Expectation here is to discover spend, def and inventory1 as logical schemas, however, that is currently not supported. Within a discovered resource set first file (as per the order in the inventory report) with a supported file format is read for columns.

Example 1:

container/warehouse/production-database/sales_data/part1.parquet
container/warehouse/production-database/sales_data/part2.parquet
container/warehouse/production-database/sales_data/part3.parquet

part1.parquet will be read for columns.

Example 2:

container/warehouse/production-database/sales_data/part1.csv
container/warehouse/production-database/sales_data/part1.parquet
container/warehouse/production-database/sales_data/part2.csv
container/warehouse/production-database/sales_data/part1.parquet
container/warehouse/production-database/sales_data/part3.csv
container/warehouse/production-database/sales_data/part1.parquet

part1.csv will be read for columns.

Example 3:

container/warehouse/production-database/sales_data/part1.csv
container/warehouse/production-database/sales_data/part1.psv
container/warehouse/production-database/sales_data/part1.tsv
container/warehouse/production-database/sales_data/part3.parquet

part1.csv will be read for columns.

To do an offline validation of the pattern:

  1. Prepare a set of sample file paths representative of real data:

    container/folder/engineering/data/emp1_data/2022/01/1.parquet
    container/folder/engineering/data/emp1_data/2022/02/1.parquet
    container/folder/engineering/data/emp1_data/2022/03/1.parquet
    container/folder/engineering/data/emp1_data/2022/04/1.parquet
    
    container/folder/engineering/data/emp2_data/2022/01/1.parquet
    container/folder/engineering/data/emp2_data/2022/02/1.parquet
    container/folder/engineering/data/emp2_data/2022/03/1.parquet
    container/folder/engineering/data/emp2_data/2022/04/1.parquet
    
  1. Replace the identified Schema Path Pattern string in this regular expression:

    (.(<replace schema path pattern string here>))/((.).((?i)\bcsv\b|(?i)\btsv\b|(?i)\bpsv\b|(?i)\bparquet\b))

    For example, if the pattern is: create regular expression as: (.(engineering/data/([a-z_0-9]*)))/((.).((?i)\bcsv\b|(?i)\btsv\b|(?i)\bpsv\b|(?i)\bparquet\b))

Use the inputs from point 1 and 2 on a website like regex101, select Java 8 as the FLAVOR. Group 1s in each match will be discovered as logical schemas.