Delta Live Tables (DLT) Overview#
Introduction#
Delta Live Tables (DLT) is Databricks' declarative framework for building reliable data pipelines. It simplifies ETL development by allowing developers to focus on data transformations while automatically handling complex pipeline operations, data quality, and error management.
What are Delta Live Tables?#
Core Concepts#
-
Declarative Development - SQL and Python support - Automatic dependency management - Built-in data quality controls - Self-healing pipelines
-
Key Components - Streaming and batch processing - Automated testing - Data quality management - Pipeline monitoring
-
Architecture Benefits - Reduced development time - Improved reliability - Simplified maintenance - Enhanced data quality
Detailed Implementation Layers#
- Bronze (Raw Layer)
Raw Data Ingestion - Supports multiple file formats (JSON, CSV, Parquet) - Streaming and batch ingestion capabilities - Auto-loading of new data
CREATE OR REFRESH STREAMING LIVE TABLE raw_data
COMMENT 'Raw data from source systems'
AS SELECT *
FROM cloud_files('/data/raw', 'json')
Schema Inference - Automatic schema detection - Schema evolution handling - Data type mapping
CREATE OR REFRESH STREAMING LIVE TABLE raw_data_with_schema
AS SELECT *
FROM cloud_files('/data/raw', 'json',
map('schema', 'id LONG, name STRING, timestamp TIMESTAMP')
)
Metadata Capture - Automatic tracking of file metadata - System column generation - Source tracking
CREATE OR REFRESH STREAMING LIVE TABLE raw_data_with_metadata
AS SELECT
*,
input_file_name() as source_file,
current_timestamp() as ingestion_time,
_metadata.file_path as file_path
FROM cloud_files('/data/raw', 'json')
Error Handling - Bad record path configuration - Error logging and tracking - Rescue data handling
CREATE OR REFRESH STREAMING LIVE TABLE raw_data_with_error_handling
AS SELECT *
FROM cloud_files(
'/data/raw',
'json',
map('cloudFiles.rescuedDataColumn', '_rescued_data')
)
Data Validation - Constraint definitions - Quality expectations
CREATE OR REFRESH STREAMING LIVE TABLE validated_data (
CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_date EXPECT (date > '2020-01-01'),
CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION FAIL UPDATE
)
AS SELECT *
FROM STREAM(LIVE.raw_data)
Business Rules - Domain-specific validations - Complex rule implementation - Custom validation logic
CREATE OR REFRESH STREAMING LIVE TABLE business_validated_data
AS SELECT *
FROM STREAM(LIVE.validated_data)
WHERE
category IN ('A', 'B', 'C') AND
amount BETWEEN 0 AND 1000000 AND
status IN ('ACTIVE', 'PENDING')
CREATE OR REFRESH STREAMING LIVE TABLE cleaned_data
AS SELECT
id,
TRIM(UPPER(name)) as standardized_name,
COALESCE(amount, 0) as cleaned_amount,
CAST(date as DATE) as formatted_date
FROM STREAM(LIVE.business_validated_data)
CREATE OR REFRESH STREAMING LIVE TABLE quality_enforced_data (
CONSTRAINT row_count EXPECT (COUNT(*) > 0),
CONSTRAINT null_check EXPECT (name IS NOT NULL AND amount IS NOT NULL),
CONSTRAINT value_range EXPECT (amount BETWEEN 0 AND 1000000)
)
AS SELECT *
FROM STREAM(LIVE.cleaned_data)
The Architecture & Pipeline Details provides a comprehensive view of how the actual DLT implementation happens in a project.