Meta Data Management - Bronze to Silver#

Meta data management is an important and relatively big topic in data architecture. On this page we concentrate on meta data essential for the work of data engineer. Using common metadata attributes allows: 1. Reliable pipelines: Metadata enables automated quality gates 2. Historical analysis: Temporal attributes support point-in-time queries 3. Auditability: Complete lineage from source to consumption 4. Efficient processing: Hash keys accelerate deduplication

We are using three types of metadata attributes#

Standardized metadata attributes ensure consistency across bronze and silver layers
System-managed fields automate data lineage and quality tracking
Explicit historization enables temporal analytics and compliance

Each metadata attribute prefixed with sys_ to be identifiable.

Metadata Attributes#

attribute name	type	possible values	applicable to	description
sys_dq_status	tinyint	`0`: Valid(passes all quality checks) `1`:Warning (passes minimum requirements but requires review) `2`:Fatal (fails critical quality rules) `NULL` = Invalidate	all	Tracks the data quality state of each record. Serves as a gating mechanism for data promotion.NULL - Indicates unprocessed records that haven't undergone quality validation. Set to NULL during initial ingestion. Once the checks are done, this field along with sys_validate_at will be updated. Do not expose records to consumers where this field is NULL.
sys_dq_messages	`ARRAY<STRING>`		all	Contains all error messages associated with the record. In future versions, it is recommended to use `ARRAY<STRUCT>` with fields like error_code, error_message, error_severity, and optional affected_fields for better error details.
sys_received_at	timestamp_ntz		all	UTC Timestamp of the last load job that contained the record. If the record wasn’t updated during the load, this still reflects the load’s start time.
sys_modified_at	timestamp_ntz		all	Represents the UTC timestamp of the last change to the record on the source system. Ideally sourced from the business data and indicates transaction datetime for transactional data or the latest modification date for master/reference data. If not available will be calculated during data load. sys_modified_at will be set to the data load job start date time if loaded record version is different from latest available version. in order to compare two record versions we should compare all business relevant attributes - all business attributes except system attributes and timestamp reflecting change of the record content. To speed up version comparison process often used the hash_key - sys_pk - the key is a hash of concatenated business relevant attributes.
sys_validate_at	timestamp_ntz		all	UTC timestamp of the record’s last validation.
By including this identifier in your datasets, you can: Trace data lineage back to system tables (e.g., Databricks system tables like `billing` or `audit logs`).
sys_is_deleted	boolean		all	Indicates if the record is marked as deleted. The value of sys_modified_at for deleted records reflects the timestamp of deletion.
sys_job_id	string		all	sys_job_id is uniquely identifies the job or pipeline run that processed the record. - For individual jobs, this is typically the `run_id`. - For pipelines, this may be an `update_id` or similar execution identifier. By including this identifier in your datasets, you can: 1. Trace data lineage back to system tables (e.g., Databricks system tables like `billing` or `audit logs`). 2. Attribute processing costs at the record level. 3. Enable cost-based optimization by analyzing runtime efficiency and resource consumption.
sys_source_name	string		all	Name of the source system or application from which the record originated.
sys_source_file_name	string		all	name of the landing zone source file the record came from
sys_is_latest_version	boolean		all	flag marking the latest valid record version
sys_valid_from	timestamp_ntz		silver master data that require scd 2 historisation	Start of the record’s validity interval. Used for SCD Type 2 historization. to join using validity interval use condition `transaction_dttm > sys_valid_from and transaction_dttm <= sys_valid_until`. for the new record sys_valid_from = sys_modified_at and `sys_valid_until = timestamp_ntz("9999-12-31 23:59:59.999999")`
sys_valid_until	timestamp_ntz		silver master data that require scd 2 historisation	end of the record validity interval
sys_hash_key	string		optional, bronze	used for effective deduplication hash of all business fields in the record. if 2 hash keys are equal then records are duplicates. We would recommend to use `SHA2({concatenate business columns}) as sys_hash_key`
sys_is_duplicate	boolean		bronze	indicate that record contains duplicate, all duplicate versions of the record except latest one are marked as duplicates. You can use the following SQL to calculate duplicates: `sql ROW_NUMBER() OVER (PARTITION BY sys_hash_key ORDER BY sys_modified_at, sys_received_at DESC) AS row_cnt.` mark as duplicates all records with row_cnt > 1 . Use filter `sql WHERE sys_is_duplicate = FALSE` to avoid counting the same duplicates twice.

Example on how to use meta data columns#

Let say we are loading data into user_profile table in bronze and then propagate it to silver. user_profile has the following business attributes: -userid - unique user identifier. used as primary key at source system -app_version - application version used by user. -created_at - date when user record was created at source system

The following query will be used to send data to silver:

To simplify silver processing, we create an interface view that will make required type and name conversions and select only valid deduplicated records: - sql NOT sys_is_duplicate - deduplicate records - sql sys_dq_status < 2 - filter out records with critical errors

CREATE VIEW <table name>_intf
SELECT <rename attributes and make type conversion to match silver requirements>
FROM <table_name>                           
WHERE NOT sys_is_duplicate AND sys_dq_status < 2'

Depending on historization type we will use different filters to get the records for silver update.
For SCD2 historization we will use the following filter:

SELECT *
FROM <interface_view_name>
WHERE sys_modified_at > '$ {silver_last_load_date}'

This filter allows us to select all valid record versions that have not been sent to silver yet.

FOR SCD1 for example consent status we can use filter by latest version that will be more efficient, especially for large tables.

SELECT * FROM <interface_view_name>                             
WHERE sys_is_latest_version'

We will mark by underscore userid of the record that is sent to silver.

We do not receive from source system modification timestamp for user_profile, so we calculate sys_modified_at based on sys_received_at We receive first record at bronze. We do not have any information about previous changes of the record so we set sys_modified_at = created_at. We set sys_modified_at for the first version of the record based on created_at to make sure that SCD 2 historization on silver will take creation date as a start of the validity period. |userid |sys_modified_at|appVersion|sys_hash_key|created_at|sys_received_at|sys_job_id|sys_is_latest_version|sys_error_status|sys_is_duplicate| |-------- |-----------------|----------|------------|----------|---------------|----------|---------------------|----------------|----------------| |1|01.01.24 |1 | AA |01.01.24 | 02.01.24 |1 | TRUE |0 | FALSE |

Next day we have received a duplicate. We insert a duplicate record and mark previous version as a duplicate. Because nothing has changed on the record we do not change sys_modified_at date |userid|sys_modified_at|appVersion|sys_hash_key|created_at|sys_received_at|sys_job_id|sys_is_latest_version|sys_error_status|sys_is_duplicate| |--------|-----------------|----------|------------|----------|---------------|----------|---------------------|----------------|----------------| |1 |01.01.24 |1 | AA |01.01.24 | 02.01.24 |1 | FALSE |0 | TRUE | |1 |01.01.24 |1 | AA |01.01.24 | 03.01.24 |2 | TRUE |0 | FALSE |

On 04.02.24 we have received new record but as duplicate. Because it is a new record version we set sys_modified_at = sys_received_at for new records We mark one of the duplicated records as duplicate. |userid |sys_modified_at|appVersion|sys_hash_key|created_at|sys_received_at|sys_job_id|sys_is_latest_version|sys_error_status|sys_is_duplicate| |-------- |-----------------|----------|------------|----------|---------------|----------|---------------------|----------------|----------------| |1 |01.01.24 |1 | AA |01.01.24 | 02.01.24 |1 | FALSE |0 | TRUE | |1 |01.01.24 |1 | AA |01.01.24 | 03.01.24 |2 | FALSE |0 | FALSE | |1 |04.01.24 |2 | A1 |01.01.24 | 04.01.24 |3 | FALSE |0 | TRUE | |1|04.01.24 |2 | A1 |01.01.24 | 04.01.24 |3 | TRUE |0 | FALSE |

On 05.02.2025 in job 4 we have received two records with the same PK but different content, we can not say which of the two records contain correct value so both records marked as critical error and will not be delivered to silver. Because content of those records is different from latest version we update sys_modified_at. Because there are no new records there is no processing for silver.

userid	sys_modified_at	appVersion	sys_hash_key	created_at	sys_received_at	sys_job_id	sys_is_latest_version	sys_error_status	sys_is_duplicate
1	01.01.24	1	AA	01.01.24	02.01.24	1	FALSE	0	TRUE
1	01.01.24	1	AA	01.01.24	03.01.24	2	FALSE	0	FALSE
1	04.01.24	2	A1	01.01.24	04.01.24	3	FALSE	0	TRUE
1	04.01.24	2	A1	01.01.24	04.01.24	3	TRUE	0	FALSE
1	05.01.24	3	A2	01.01.24	05.01.24	4	FALSE	2	FALSE
1	05.01.24	4	A3	01.01.24	05.01.24	4	FALSE	2	FALSE

On 06.01.24 We have received a valid record that is a duplicate of error record from previous load. Because it is a valid version of the record we set the sys_modified_at = sys_received_at and update the record.

userid	sys_modified_at	appVersion	sys_hash_key	created_at	sys_received_at	sys_job_id	sys_is_latest_version	sys_error_status	sys_is_duplicate
1	01.01.24	1	AA	01.01.24	02.01.24	1	FALSE	0	TRUE
1	01.01.24	1	AA	01.01.24	03.01.24	2	FALSE	0	FALSE
1	04.01.24	2	A1	01.01.24	04.01.24	3	FALSE	0	TRUE
1	04.01.24	2	A1	01.01.24	04.01.24	3	FALSE	0	FALSE
1	05.01.24	3	A2	01.01.24	05.01.24	4	FALSE	2	TRUE
1	05.01.24	4	A3	01.01.24	05.01.24	4	FALSE	2	FALSE
1	05.01.24	3	A2	01.01.24	06.01.24	4	TRUE	0	FALSE