Ingestion inputs outputs

Go to Playbook Main Page
Go to landing Zone to Bronze Zone Main Page
Go to Modular Components
Next: Go to File Ingestion Using Autoloader

Inputs and Outputs in File Ingestion Pipelines#

In the realm of data engineering, clarity on inputs and outputs forms the cornerstone of an effective file ingestion pipeline. As the first step in the data lifecycle, ingestion often handles a wide variety of file formats, data types, and infrastructure nuances. Without a clear understanding of the exact inputs and outputs required for these pipelines, the process can quickly become error-prone, inefficient, and difficult to maintain.

Knowing precisely what data needs to be ingested (inputs) and what format or structure it must ultimately take (outputs) forms the foundation for designing robust ingestion pipelines. Inputs define the raw data—its source, structure, quality, and relevant metadata—while outputs determine how the ingested data is stored, processed, and made available for downstream systems or analyses. Misalignment in this area can lead to significant issues, such as schema mismatches, incomplete transformations, or unmet business requirements.

By establishing well-defined inputs and outputs, data engineers can:

Ensure Consistency: Define clear data contracts to describe the structure, quality, and expectations for both inputs and outputs, reducing the risk of discrepancies.
Drive Automation: Streamline the ingestion process with automated validation, transformation, and data delivery based on agreed-upon configurations.
Improve Scalability: Create modular pipelines that can easily adapt to changes in data sources or business needs while maintaining downstream reliability.
Enable Collaboration: Provide clarity for stakeholders like data scientists, analysts, and operational teams about the available data and its usability.

Taking the necessary time to map out inputs and outputs transforms ambiguity into action, reduces technical debt, and provides a foundation for reliable, scalable, and future-proof pipelines. The sections below delve deeper into defining inputs and outputs, connecting them to configuration, and handling challenges with diverse file ingestion use cases.

The pipeline execution starts by processing inputs, which drive how the ingestion workflow behaves, and ends by producing outputs, which are consumed downstream for further processing, validation, or reporting. The Config acts as the main source for inputs and dynamically influences outputs.

Inputs#

The inputs are key parameters passed to the pipeline. They control the workflow’s behavior, enabling dynamic adaptability across environments such as dev, test, or prod.

How Configuration Acts as an Input#

Configuration files encapsulate all input definitions, acting as a centralized source to manage input parameters. By abstracting critical pipeline variables, the configuration ensures consistency and flexibility.

Key ways the configuration serves as an input: 1. Dynamic Path Resolution: - Input paths like incoming_path and checkpoint incorporate environment details, allowing seamless switching between dev, test, or prod environments without code changes. - Example:

{
  "landing_volume_path": "abfss://landing@storage.dfs.core.windows.net",
  "incoming_path": "incoming_files",
  "failed_path": "failed_files"
}

The configuration dynamically builds the paths. For example:

incoming_path = f"{config['landing_volume_path']}/{config['incoming_path']}"

Environment Context: - Parameters like env adjust pipeline behavior based on the target environment (e.g., development vs. production). - Example: In production (prod), stricter quality checks may be enforced, while in development (dev), some validations may be skipped.
Schema and Table Specifications: - Configurations define what tables need to be created dynamically and their respective schemas. - Example: The catalog and schema_base organize tables logically in Databricks’ Unity Catalog.

List of Inputs#

Here are the primary inputs used in the pipeline:
1. Pipeline Mode: Batch or Streaming operation.
2. Environment Settings: Information about the runtime environment (e.g., dev, test, or prod).
3. Data Sources and Paths: Incoming file paths, landing zones, and checkpoint directories. 4. Configuration Settings: Flags for validations, catalogs, schemas, and dynamic table generation.

Input Details#

Input Name	Description
mode	Specifies the pipeline Ingestion technique type:Autoloader,Pyspark,DLT streaming.
env	Determines the runtime environment, such as dev, test, or prod.
catalog	Specifies the Databricks catalog in which the tables will be created.
schema_base	The name of the schema within the catalog to organize tables.
json_types	List of raw tables (in JSON schema) to be created dynamically for ingestion.
landing_volume_path	Path to the raw data landing zone where flat files are ingested.
data_provider_path	Subdirectory under the landing zone for handling provider-specific data.
incoming_path	Defines the folder that holds incoming files awaiting processing.
failed_path	Specifies the directory where files failing validation are stored.
checkpoint	Path for maintaining streaming checkpoints to ensure fault tolerance.
quality_checks	Flags for quality validations like schema validation and PK constraints.

Outputs#

The outputs are the final results produced by the pipeline. These outputs are critical for downstream workflows such as analytics, reporting, and auditing.

List of Outputs#

The pipeline generates the following outputs:
1. Delta Tables: Stores processed data for downstream processing.
2. Validation Logs: Tracks schema compliance and data quality checks.
3. Successfully Processed Files: A list of ingested files.
4. Checkpoints: Ensures fault tolerance in streaming pipelines.
5. Bad Records Directory: Holds files that failed validation for auditing.

Output Details#

Output Name	Description
Delta Table	A Delta table populated with validated and enriched records.
Validation Logs	Captures validation results, including schema compliance and data quality statuses.
Successfully Loaded Data	Stores paths of validated and processed files post-ingestion.
Checkpoints Path	A directory for maintaining checkpoints to track ingestion progress, especially in streaming pipelines.
Bad Records Directory	Location for files that didn’t pass ingestion validations, for further inspection or audit.

How Configuration Influences Outputs#

Like inputs, outputs are also defined and influenced dynamically through configuration files. Examples include: - Delta table definitions (catalog, schema_base) map directly into a structured data lake. - Validation logs and bad records location are assigned paths using configuration (failed_path, validation_logs_path). - Streaming checkpoints ensure incremental reads are fault-tolerant.

Example Configuration for Outputs: ℹ️#

  {
    "databricks": {
      "catalog": "prod_catalog",
      "schema_base": "bronze_prod",
      "delta_tables": [
        {
          "name": "transactions",
          "partition_by": "date",
          "storage_path": "transactions/"
        }
      ],
    "failed_path": "failed_files/",
    "checkpoints_path": "checkpoints/"
    }
  }

Click here to access the repository: DHO-PoC1/src/conf/databricks_config.py

For Detailed Explanation of Using Config as input check the config module in Modular components