Skip to content

Go to Playbook Main Page
Next: File Ingestion Using Autoloader(Python)
Back: Back to previous page

File Ingestion Pipeline: Overview#

Introduction#

The File Ingestion Pipeline architecture is designed to support multiple ingestion methods while ensuring operational consistency, scalability, and observability. This pipeline processes structured, semi-structured, and unstructured data files from the Landing Zone to the Bronze Layer in Databricks, leveraging tools such as Autoloader, Python-based pipelines, and Delta Live Tables (DLT).

The process includes metadata enrichment, validation, and management of both successfully processed files and invalid records. Below is the high-level architecture:

Hold "Alt" / "Option" to enable Pan & Zoom
Architecture Diagram


  1. High-Level Process Flow
    - Landing Zone - Data Validation and Enrichment - Persisting Data in the Bronze Layer - Processed File Management

  2. Key Components
    - Description of essential components such as Landing Zone, Autoloader, Python-Based Pipelines, and more.

  3. Metadata Added by Autoloader
    - A detailed list of critical metadata fields added during the ingestion process.


High-Level Process Flow#

1. Landing Zone#

  • Incoming raw files are stored in an Azure Data Lake (ADLS2 Blob Storage) Landing Zone, which is mapped as a Databricks Volume through Unity Catalog.
  • Supported file types include structured (e.g., CSV), semi-structured (e.g., JSON), and unstructured data.

2. Data Validation and Enrichment#

  • Data is ingested using one of the following methods:
  • Autoloader for incremental, real-time ingestion.
  • Python-based pipelines for batch processing with custom logic.
  • Delta Live Tables (DLT) for declarative workflows with built-in validations.
  • Metadata is added for traceability and governance:
  • ingestion_time: Timestamp when the record was ingested.
  • file_name: Name of the source file.
  • job_id: Unique execution identifier for the pipeline.
  • sys_source_filename: Traceability of file origin.
  • sys_received_at: Timestamp tracking data receipt into the pipeline.
  • Invalid or malformed records are routed to a bad files directory for troubleshooting and later resolution.

3. Persisting Data in the Bronze Layer#

  • The cleaned and enriched data is stored in Delta tables within the Bronze Layer, ensuring ACID compliance.
  • Partitioning is applied as needed for optimized querying and faster reads.
  • Metadata annotations enable lineage tracking and observability, helping users understand data flow and transformations.

4. Processed File Management#

  • Files successfully processed are moved to the loaded directory, ensuring that they are not reprocessed.
  • Error logs and validation results are generated for operational insights and debugging.

Key Components#

Component Description
Landing Zone Staging area in Azure Data Lake where raw JSON files reside, mapped as Databricks Volumes to Unity Catalog.
Autoloader Ingestion Stream-based ingestion tool with schema evolution capabilities, designed for high scalability and efficiency.
Python-Based Pipeline Custom batch ingestion pipeline that enables flexible, user-defined processing logic for specific use cases.
Delta Live Tables (DLT) Declarative framework for building and maintaining managed pipelines with integrated data lineage and quality checks.
Bronze Layer Delta table layer to store raw, validated data with added metadata for traceability and ACID compliance.
File Management Mechanism for moving processed files to prevent reprocessing and routing invalid records to a bad file directory.

Audit Date Added by Autoloader#

When files are ingested using Databricks Autoloader, several critical metadata fields are added automatically to enable traceability, data lineage, and governance. Below is the list of metadata fields:

Metadata Field Description
ingestion_time Timestamp indicating when the data was ingested into the pipeline.
file_name The name of the file from which the data was ingested.
job_id A unique identifier associated with the specific pipeline execution.
sys_source_filename The fully qualified name of the source file for traceability.
sys_received_at Operational timestamp indicating when the data was received in Databricks.
_metadata Autoloader-generated metadata, including file offsets, event time, and schema evolution details.