Data Quality Gates Framework#
Data Quality Gate term refers to a set of data quality checks executed at a particular stage of a Data Product's data propagation flow.
Data Quality Gates can be classified by:
- Stage – e.g., Landing-to-Bronze gate, Bronze-to-Silver gate.
- Functional Purpose – e.g., technical checks gate, business rules gate.
A Data Quality Gate prevents low-quality data from propagating to the final Data Product. It achieves this by:
- Blocking data that fails quality checks and moving it to a data quarantine area for later handling.
- Explicitly marking poor quality records for handling in subsequent gates or view queries.
The arrangement of gates may vary depending on the data ingestion and propagation scenarios.
Below, descriptions of some commonly used gates are provided. These can be modified as necessary to fit specific Data Product use cases.
Landing Area Delivery Consistency Gate#
Purpose:#
The purpose of this Gate is to ensure ingested data consistency.
This is achieved by validating that all entities are delivered to the Landing area within the agreed timelines, as specified in the ingestion contract.
Scenarios:#
- Standard Daily Ingestion * The check runs after the ingestion pipeline starts. * Status: Pass if entities are delivered within the agreed timeline or before the pipeline starts. * Status: Fail if entities arrive after the pipeline starts.
- Late Delivery Detection * Handles cases where entities are delivered outside the timeline. * Example: For a 9 PM–11 PM contract, entities arriving after the scheduled pipeline (e.g., 6:30 AM) fail the check.
- Manual Re-run or Error Handling * During manual ingestion re-runs, checks are ignored to avoid false failures. * If a pipeline fails and restarts, checks run based on adjusted parameters.
Landing/Raw Area Structure and Content Gate#
Purpose:#
The purpose of this Gate is to ensure that data snapshots (chunks) comply with structural and content requirements before being propagated further.
Checks#
The checks apply to data in a tabular format. If the source data is in JSON or XML, it must be parsed first.
| Check | Description |
|---|---|
| File Correctness Check | This check ensures the integrity and correctness of the data file by attempting to read it. It identifies corrupted or improperly formatted files that could cause errors during data processing. |
| Columns name correctness | Verifies that the list of columns in Raw data matches the entity configuration. Extra, missing, or duplicate columns result in quarantine of the chunk. |
| EmptinessCheck: Source file empty | Ensures that the source file is not empty. Empty files lead to quarantine of the entire snapshot. |
| Emptiness Check | Ensures that mandatory columns contain data. Missing data in mandatory fields leads to quarantine. |
| Schema Validation | Validates data against the expected schema. Invalid file schema leads to quarantine. |
| Types Validation | Ensures all columns can be converted to the target data types. If conversion fails, the entire chunk is quarantined. |
| Full Duplicates | Identifies fully identical records, collapses duplicates into a single record, and passes the result further. |
| Primary keys validity | Checks for non-unique primary keys. Duplicates are moved to Quarantine. |
Aggregated Data area Integrity and Accuracy Etalon Gate#
Etalon data is produced from Landing/Bronze pre-aggregated data for Silver/Golden Layer processed data to be compared against.
Purpose:#
The purpose of this Gate is to ensure that the data snapshot built from the Silver Layer matches the source system snapshot.
It detects and resolves discrepancies between the two, ensuring data integrity and accuracy.
This Gate is particularly valuable for derived Data Products populated by independent pipelines or through different modes (e.g., full, incremental, rolling periods).
Etalon Check Description#
The etalon check compares aggregated metrics (e.g., amount, cost) for the current period (e.g. year) plus defined amount of past periods (e.g. the last three years). with defined granularity. Metrics from Silver are compared against independently extracted etalon metrics, and differences are analyzed.
Pass/Fail Criteria#
Accepted differences: e.g. less of 100 EUR or 0.01%. If any metric for a specific entity/month does not match, the entire snapshot fails. Failed snapshots can either proceed to Gold or require fixing, depending on the policies.
Common Failure Reasons and Solutions#
- Data in Quarantine Cause: Source data snapshot (chunk) contains incorrect records. Solution: Fix data at the source or correct policies.
- Issues with Incremental Merge Cause: Deleted records not handled correctly. Solution: Tune the incremental processing algorithm.
- Precision of Numerical Metrics Cause: Use of non-precise formats (e.g., float). Solution: Use fixed-precision decimals or extend accuracy limits.
- Non-Synchronous Data Extraction Cause: Etalon and checked data collected at different times. Solution:
- Reduce extraction window.
- Use snapshot isolation mode.
- Switch to incremental mode.
- Extend allowed limits for recent data.
- Incremental Data Extraction Issues Cause: Incorrect delta calculations. Solution: Tune source-side delta algorithms.
Benchmarking Gate#
Purpose:#
The objective of the Data Quality Benchmarking Gate is to validate data consistency and detect discrepancies or anomalies in a new data load.
This is achieved by comparing the new dataset (snapshot) with benchmarks from the previously successful data load (e.g., total amounts at a defined granularity).
Consistency checks are essential to identifying and addressing data quality issues, thereby enhancing the reliability and trustworthiness of Data Products.
Validation occurs on the pre-curated layer, ensuring that all critical scenarios are checked and only the most accurate data populate the next curated layer.
Check:#
Ensure the data consistency by checking the following or other scenarios:
- compare the pre-curated state before and after the refresh,
- detect and explain significant changes in the past, and gaps in data by defined granularity (time spans, business [sub]units),
- explain sudden big fluctuations for the current period,
- assess the total amount trend over years and months.
Business Rules Gate#
Purpose:#
The objective of the Data Quality Business Rules Gate is to validate data integrity by establishing specific business rules and metrics, and controlling them by comparing the new data load (snapshot) against predefined thresholds.
Integrity checks are crucial for identifying and resolving data quality issues, improving the reliability and trustworthiness of Data Products.
Validation occurs on the Silver Layer, ensuring all essential scenarios are checked, so the most accurate data advance to the Gold Layer.
Checks:#
- calculate (run calculations of) some specific metrics on the new data load (snapshot),
- detect and assess the metrics at lower levels against predefined thresholds,
- investigate and explain notable discrepancies against thresholds for the current snapshot.