Skip to content

Insulin

Delta Live Table: File Ingestion and Processing for Insulin Data#

This section explains the end-to-end process of file ingestion using Delta Live Tables (DLT) for insulin data. The pipeline processes raw data from the Landing Zone, validates and enriches it with metadata fields for lineage and auditing, harmonizes attributes, and integrates the data into a Silver layer while maintaining SCD (Slowly Changing Dimensions) Type 1 integrity.

Steps#


Step 1: Load Raw Data from Landing Zone#

Purpose:
Ingest raw JSON data from the Landing Zone into a Bronze Delta table using the DLT streaming framework. Enrich the data with operational metadata for traceability and management.

Significance:
- Facilitates fault-tolerant ingestion with incremental updates.
- Captures operational metadata (sys_job_id, sys_received_at, etc.) for lineage tracking.
- Prepares data for downstream validation and harmonization processes.

Added Metadata Fields: - sys_surrogate_pk: A unique surrogate key generated using uuid() for each record.
- sys_job_id: The pipeline execution ID for tracking.
- sys_source_file_name: The name of the source file the data originated from.
- sys_received_at: Timestamp when the data was ingested into the pipeline.
- sys_modified_at: Captures the modification timestamp for each record using a custom UDF.

Explanation: - Input Data:
- JSON files from the landing zone path. - Operations:
- Incrementally reads raw data using cloud_files(autoloader) - Cross-joins pipeline metadata (sys_pipeline_last_run) for operational context.This is a materialised view that contains the databricks environment information which can be used for further processing.

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

  • The bad records gets loaded to a separate directory of the bad records file path which can be sent to the producers for rectification.
  • Outputs:
  • Raw structured data enriched with system metadata columns in the insulin_raw Delta table.

Reference Code ℹ️#

-- STEP 1 LOAD RAW DATA FROM LANDING ZONE
CREATE OR REFRESH STREAMING TABLE insulin_raw
COMMENT 'Raw data from kakao landing zone for insulin'
TBLPROPERTIES('quality' = 'bronze',
              'source'  = 'kakao')
PARTITIONED BY (sys_source_file_name)
AS
SELECT
   uuid()                            as sys_surrogate_pk,
   data.*,
   pipeline.origin.update_id         as sys_job_id,           -- pipeline run id.
   _metadata.file_name               as sys_source_file_name,
   pipeline.timestamp                as sys_received_at,
   digital_health_catalog_bronze_dev.kakao_dev_dlt.get_modified_at(_metadata.file_name, sys_source_file_name, 'file_name') as sys_modified_at
FROM cloud_files(
                 "${path}insulin//","json",
                 map("schema", "struct<id:int, userId:int, smartPenId:int, duid:string,name:string, insulinType: string, dose:float, injectedAt:string>",
                "badFilesPath","${bad_files_path}")
) as data
CROSS JOIN sys_pipeline_last_run pipeline;

Step 2: Validate Data and Harmonize Attributes#

Data Validation#

Purpose:
Applies data quality constraints to ensure data integrity and relevance. Invalid rows (e.g., missing critical fields) are dropped.

Validation Constraints: 1. mandatory_createdAt: Ensures the created_at column is not null.
2. mandatory_userId: Ensures the user_id column is not null. Rows violating this constraint are dropped.
3. can_not_harmonize_insulin_type: Ensures the insulin_type column is not null. Rows violating this constraint are dropped.
4. harmonized_insulin_type: Ensures the harmonized value (insulin_type) matches the raw attribute value (original_insulin_type).

Attribute Harmonization#

Purpose:
Maps raw attribute values (insulinType) to standardized, harmonized values (insulin_type) using a reference table (insulin_type_harmonization).

Significance:
- Ensures consistent attribute representation across datasets. - Drops or flags rows for which harmonization cannot be performed.

Data Validation and Harmonization Workflow: 1. Reads data incrementally from the Bronze table (insulin_raw). 2. Performs constraints validation and data harmonization with a reference dataset.
3. Outputs validated, harmonized data to the insulin Delta table while excluding invalid records.

Reference Code ℹ️#

-- STEP 2 VALIDATE DATA QUALITY & HARMONIZE
CREATE OR REFRESH STREAMING TABLE insulin
(
  CONSTRAINT mandatory_createdAt EXPECT (created_at IS NOT NULL),
  CONSTRAINT mandatory_userId EXPECT (user_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT can_not_harmonize_insulin_type EXPECT (insulin_type IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT harmonized_insulin_type EXPECT (insulin_type IS NOT NULL and insulin_type = original_insulin_type)
)
COMMENT 'Validated. harmonized data for insulin'
TBLPROPERTIES('quality' = 'bronze',
              'source'  = 'kakao')
AS
SELECT  
     sys_surrogate_pk,
     up.id, userId as user_id, smartPenId as smart_pen_id, duid as duid, name as name,  dose as dose,injectedAt as injected_at,
     dth.harmonized_value as insulin_type,up.insulinType as original_insulin_type
     ,sys_modified_at, sys_source_file_name, sys_received_at
FROM STREAM(LIVE.insulin_raw) as up
LEFT JOIN common_bronze.insulin_type_harmonization dth ON up.insulinType  = dth.value;

Here is a glimpse of the insulin_type_harmonization table looks like:

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Step 3: Integrate Validated Data to Silver Layer#

Purpose:
Integrates validated, harmonized, and deduplicated data from the Bronze layer into the Silver layer while maintaining SCD (Slowly Changing Dimension) Type 1 logic.

Significance:
- Ensures data is deduplicated and historized with proper lineage and modification tracking.
- Silver tables serve as a clean layer for downstream analytics with enhanced data quality.
- SCD Type 1 ensures only the most recent version of the data is preserved.

Key Features of SCD Type 1 Implementation: 1. Deduplication and Integration: - Ensures no duplicate rows are inserted. The "APPLY CHANGES" step in the code ensures that. - Integrates harmonized data into the Silver layer - Uses business key (id) for record uniqueness

  1. Data Evolution: - Uses sys_modified_at as the timestamp to track sequential changes - Deletes rows (APPLY AS DELETE) when sys_is_deleted = true - Maintains only the latest version of each record

  2. Stored Columns: - Excludes certain columns (e.g., original_insulin_type) - Retains all business and system columns - Preserves data lineage information

Reference Code ℹ️#

-- STEP 3 INTEGRATE TO SILVER
CREATE OR REFRESH STREAMING TABLE digital_health_catalog_silver_dev.kakao_dev_dlt.insulin
COMMENT 'Validated, harmonized, deduplicated and historised  data for insulin'
TBLPROPERTIES('quality' = 'bronze',
              'source'  = 'kakao');

APPLY CHANGES INTO digital_health_catalog_silver_dev.kakao_dev_dlt.insulin
FROM STREAM(LIVE.insulin)
KEYS(id)
--APPLY AS DELETE WHEN sys_is_deleted = true
SEQUENCE BY sys_modified_at
COLUMNS * EXCEPT (original_insulin_type)
STORED AS SCD TYPE 1;

Summary of Delta Tables in the Pipeline#

  1. insulin_raw: Raw data ingested from the landing zone, enriched with metadata fields.
  2. insulin: Validated and harmonized Bronze table after applying data quality checks and attribute harmonization.
  3. digital_health_catalog_silver_dev.kakao_dev_dlt.insulin: Silver table with validated, harmonized, deduplicated, and historized data stored with SCD Type 1 logic.

Key System Columns#

Column Name Purpose
sys_surrogate_pk Unique identifier for each record.
sys_job_id Tracks the pipeline job execution responsible for the record.
sys_source_file_name Name of the file the record was ingested from.
sys_received_at Timestamp when the record was ingested.
sys_modified_at Timestamp for record-level modifications to maintain data evolution.
insulin_type Harmonized attribute derived from the raw field (up.insulinType).

Final Flow Diagram#

+------------------------+     +------------------------+     +------------------------+     +--------------------------------+
|     Source Data        |     |      Bronze Table     |     |      Bronze Table     |     |         Silver Table           |
|    Landing Zone       ----->|      insulin_raw      ----->|       insulin         ----->|            insulin             |
|     JSON Files        |     |                      |     |                      |     |                              |
+------------------------+     +------------------------+     +------------------------+     +--------------------------------+
         |                              |                              |                                |
         |                              |                              |                                |
    Raw JSON Data               Added Metadata &                Data Validation &                  SCD Type 1
    - id                        System Columns:                 Harmonization:                    Implementation:
    - userId                    - sys_surrogate_pk             - Quality constraints             - Deduplication
    - smartPenId                - sys_job_id                   - Field standardization           - Historization
    - insulinType               - sys_source_file_name         - Reference data mapping          - Latest version only
    - dose                      - sys_received_at              - Business rules                  - Tracked by sys_modified_at
    - injectedAt                - sys_modified_at              - Error handling              - Exclude original_insulin_type

Data Flow Details:

  1. Source Data (Landing Zone) - Format: JSON files - Location: Cloud storage - Schema: Contains insulin-related data fields - Key fields: id, userId, smartPenId, insulinType, etc.

  2. Bronze Table (insulin_raw) - Quality: Bronze - Type: Streaming Delta Table - Added metadata: sys_surrogate_pk, sys_job_id, etc. - Partitioned by: sys_source_file_name

  3. Validated & Harmonized Bronze Table (insulin) - Quality: Bronze - Type: Streaming Delta Table - Validation: Data quality constraints - Harmonization: Standardized insulin_type values - Business logic: Field mapping and transformations

  4. Silver Table (digital_health_catalog_silver_dev.kakao_dev_dlt.insulin) - Quality: Silver - Type: Delta Table with SCD Type 1 - Features: Deduplicated and historized data - Key columns: All except original_insulin_type - Change tracking: Using sys_modified_at

Data Quality Progression: - Landing → Bronze: Raw data preservation with metadata - Bronze → Validated Bronze: Data quality enforcement - Validated Bronze → Silver: Business logic implementation

Return to Main Page