Index

Go to Playbook Main Page
Next: Data Ingestion Phase - Source to Landing Zone
Next: Data Ingestion Phase - Landing Zone to Bronze

Preparation Phase#

The preparation phase is a critical component of the data ingestion process that sets the foundation for successful pipeline development, testing, and deployment. During this phase, data engineers focus on creating the necessary artifacts and configurations to facilitate seamless testing and management of pipeline logic. This stage ensures that the data landing zone and subsequent layers (Bronze, Silver, and Gold) are properly provisioned to meet business and technical requirements.

In this phase, engineers define test cases, prepare sample test data, and establish outbound interfaces such as views that abstract Delta Lake tables in order to support efficient querying, validation, and downstream data consumption. The preparation phase enhances code quality and workflow automation, providing the confidence to move forward in the ingestion lifecycle. Below is an introduction to the specific topics covered during the preparation phase.

Together, these preparation phase activities ensure a structured, reliable workflow for ingesting, transforming, and presenting data across the pipeline layers, setting the groundwork for continued refinement and scaling.

Process Flow#

Hold "Alt" / "Option" to enable Pan & Zoom

1 - Define Test Cases#

Defining test cases involves determining the scenarios and criteria against which the data ingestion pipelines will be validated. These test cases are designed based on business requirements, expected data transformations, and edge cases, ensuring that the pipeline behaves as expected. By clearly specifying input conditions and expected outputs, engineers have a blueprint for evaluating the functionality, performance, and reliability of their pipeline code.

A Test-Driven Development (TDD) approach in the context of a Databricks pipeline involves designing and implementing the pipeline code by writing tests first — before writing the actual implementation for data ingestion, transformation, or processing. This ensures that each stage and component of the pipeline is both functional and reliable from the beginning, allowing potential issues to be caught early. TDD emphasizes writing small, incremental tests for individual business logic or transformations, progressively building toward a robust and validated pipeline that meets data quality, performance, and reliability requirements.

Steps for Test-Driven Development in Databricks#

Define Requirements and Test Cases:
Create the Test Logic First:
Run Tests and Confirm Failure:

Click here for a template MD file.

2 - Prepare Test Data (Landing, Bronze, Silver)#

Test data preparation is the creation of sample datasets that mimic real-world scenarios. Engineers prepare representative data for the landing zone (raw data receiving area) and Bronze layer (raw data persistence). This data serves as a controlled input for testing pipeline transformations and validations. Preparing high-quality test data enables engineers to identify bugs, corner cases, and data integrity issues early in the development cycle.

This example demonstrates creating and writing test data into a Delta table that serves as the Bronze Layer. It also highlights a standard approach for preparing representative data for testing purposes.

Here’s a technical example of how to prepare test data for a Databricks Bronze Layer ingestion pipeline. This example demonstrates creating and writing test data into a Delta table that serves as the Bronze Layer. It also highlights a standard approach for preparing representative data for testing purposes.

# Step 1: Import Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp

# Step 2: Initialize SparkSession (Databricks automatically initializes SparkSession)
spark = SparkSession.builder.appName("BronzeLayerTestDataPreparation").getOrCreate()

# Step 3: Define the Schema for the Bronze Layer Data
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("event_timestamp", TimestampType(), True)
])

# Step 4: Create the Test Data
data = [
    ("1", "John Doe", "john.doe@example.com", None),   # Valid record
    ("2", "Jane Smith", "jane.smith@example.com", None),  # Valid record
    ("3", "Alice", "alice@@invalid_example_com", None),  # Invalid email format
    ("4", None, "no.name@example.com", None),           # Missing name
    ("5", "Bob Brown", None, None)                      # Missing email
]

# Apply the test data and add a current timestamp for the event
test_data_df = spark.createDataFrame(data, schema=schema) \
    .withColumn("event_timestamp", current_timestamp())

# Step 5: Configure the Bronze Layer Delta Table Path and Write the Test Data
bronze_layer_path = "/mnt/bronze_layer/test_data_bronze"

# Write the test data to the Bronze Layer Delta Table
test_data_df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(bronze_layer_path)

# Step 6: Replace or Create Delta Table in the Databricks Catalog
spark.sql(f"""
CREATE TABLE IF NOT EXISTS bronze_table
USING DELTA
LOCATION '{bronze_layer_path}';
""")

# Step 7: Validate the Test Data in the Bronze Layer
spark.sql("SELECT * FROM bronze_table").show(truncate=False)

Explanation#

Importing Required Libraries: - pyspark is used to handle DataFrame creation and data manipulation. - current_timestamp dynamically adds the current timestamp for testing event_timestamp.
Defining Schema: - A schema is defined for the data to represent the structure of the Bronze Layer table. It matches the raw data ingestion format.
Creating Test Data: - A mock dataset (data) is created to represent various test cases:
- Valid records: Fully populated and correctly formatted data.
- Invalid records: Examples include missing fields, improperly formatted email addresses, or other issues.
- A DataFrame (test_data_df) is created using the test data and augmented with a dynamically generated event_timestamp.
Writing to Delta Lake: - The test data is written to a Delta Lake format in the specified storage path (bronze_layer_path). - The mode("overwrite") ensures that the existing data is replaced for clean testing.
Registering the Delta Table: - The CREATE TABLE SQL statement ensures that the Delta table (bronze_table) is registered in the Databricks metastore, making it queryable.
Validating the Test Data: - The SELECT * FROM bronze_table SQL query allows you to retrieve and visually confirm that the Bronze Layer table contains the correct test data.

Example Query Output#

After successfully running the code, querying the bronze_table could output something like the following:

id	name	email	event_timestamp
1	John Doe	john.doe@example.com	2023-10-01 12:00:00
2	Jane Smith	jane.smith@example.com	2023-10-01 12:00:00
3	Alice	alice@@invalid_example_com	2023-10-01 12:00:00
4	NULL	no.name@example.com	2023-10-01 12:00:00
5	Bob Brown	NULL	2023-10-01 12:00:00

How This Helps#

By preparing this test data: - Data engineers can validate ingestion pipelines for edge cases (e.g., null values, incorrect formats, etc.). - Downstream transformations and validations are tested in a controlled environment. - It ensures that your Bronze Layer ingestion logic can handle real-world production data quirks.

3 - Create Test Services (Landing, Bronze, Silver)#

Creating testing services for Databricks ingestion pipelines using the test data you've prepared earlier provides numerous benefits. A dedicated testing service ensures repeatability, scalability, automation, and robustness in your development and testing process. Below are detailed reasons why you should invest in building such services:

1. Automation of Testing#

Why?: Manually testing ingestion pipelines is time-consuming and prone to human error. When pipelines grow in complexity or when ingestion logic changes, rerunning all test cases manually becomes impractical.
Impact:
- Testing services allow you to automate end-to-end scenarios for validating ingestion, transformations, and outputs.
- They can be triggered programmatically (e.g., via CI/CD pipelines) to verify that changes to the codebase do not break functionality.

2. Regression Testing#

Why?: Ingestion pipelines must evolve over time—whether due to schema changes, new data sources, or performance optimizations. These updates may inadvertently introduce bugs or regressions in existing functionality (e.g., invalid handling of previously valid data).
Impact:
- Testing services ensure previously working functionality continues to perform as expected by running regression tests using your pre-defined test data.
- This minimizes the risk of breaking production pipelines due to code changes.

3. Consistency and Repeatability#

Why?: Without a dedicated service, testing can become ad hoc, inconsistent, or dependent on the system/engineer running the test.
Impact:
- Testing services improve reliability by running repeatable tests under consistent conditions with defined test data.
- This consistency ensures you can reliably compare results over time, regardless of the environment or tester.

4. Edge Case and Error Handling Validation#

Why?: Real-world data is messy and unpredictable—it can contain null values, schema mismatches, invalid formats, duplicated records, or violations of business rules. It's important to test edge cases, not just the "happy path."
Impact:
- Pre-prepared test data covering a variety of edge cases can be systematically validated through the testing service (e.g., null handling, invalid records quarantine, duplicate detection).
- Testing services can simulate scenarios where pipelines fail gracefully and ensure proper error logging, notifications, and retries.

5. Simplification of Complex Testing Scenarios#

Why?: Databricks ingestion pipelines often involve multiple stages (e.g., landing zone → Bronze → Silver → Gold) and distributed data processing. Testing each stage separately as well as in an integrated manner requires orchestrating test data, configurations, and validation logic.
Impact:
- Testing services orchestrate and simplify the execution of multi-stage tests, such as:
- Schema validation in the Bronze layer,
- Data cleansing and deduplication in the Silver layer,
- Aggregations and KPIs in the Gold layer.
- They isolate failures to specific stages for faster debugging.

6. Scalability and Load Testing#

Why?: Pipelines that perform well for small datasets may fail or become bottlenecks when dealing with production-scale data. Testing ingestion pipelines for scalability ensures performance holds under load.
Impact:
- The testing service can simulate large-scale ingestion scenarios using mock data generated programmatically from your test data schema.
- Performance metrics, such as ingestion time, processing time, and memory/CPU usage, can be captured and benchmarked.

7. Support for CI/CD and Agile Development#

Why?: In agile development, pipelines are frequently updated, and those updates need to be tested and released in shorter cycles. A manual testing process slows this down and risks introducing errors into production.
Impact:
- Testing services can be integrated into CI/CD pipelines to automatically run comprehensive tests (e.g., schema validation, data quality checks, transformations) before merging new code or deploying it to production.
- This ensures that only validated code is pushed forward, enabling faster and more confident releases.

8. Data Quality Validation#

Why?: The quality of ingested data forms the foundation for downstream processing and analytics. Issues such as duplicate records, incorrect matches (in MDM), or invalid entries can lead to poor decision-making.
Impact:
- Testing services can systematically evaluate data quality rules built into your pipeline—such as uniqueness, correctness, completeness, and adherence to business rules—using pre-defined rules and test datasets.
- Trends in data quality over time can also be monitored (e.g., reducing missing fields or validation errors).

9. Reusability Across Pipelines#

Why?: Organizations often have multiple pipelines with shared behaviors (e.g., recurring schema validations, similar transformation logic, or business logic). Duplicating test logic for every pipeline wastes effort and increases maintenance.
Impact:
- A centralized testing service allows you to create modular, reusable tests that can be applied across multiple pipelines or data sources.
- For example:
- The same schema validation logic for Bronze can be reused across pipelines ingesting customer or product data.
- Common data quality tests (e.g., invalid email format) can be applied consistently.

10. Compliance and Auditability#

Why?: Many industries, such as finance and healthcare, have regulatory requirements around data usage, quality, and handling. Demonstrating compliance requires maintaining an audit trail of test runs and validation logic used for pipelines.
Impact:
- Testing services can log and version all test results for each ingestion pipeline, providing an auditable trail of validations.
- This is crucial for compliance reviews, audits, or post-mortem analysis in cases of failure.

11. Reduced Risk in Production#

Why?: Production ingestion pipelines are tied to critical business processes, and failures in production can lead to downstream outages, incorrect analytics, or revenue-impacting incidents.
Impact:
- Testing services provide a sandbox-like environment to test and validate pipelines before deploying to production.
- They reduce risks by catching issues like incorrect transformations, schema mismatches, or performance bottlenecks early in the lifecycle.

12. Faster Debugging and Troubleshooting#

Why?: When ingestion pipelines fail, debugging can be difficult without structured testing and insights into where issues are occurring.
Impact:
- Testing services can log detailed insights for test failures, including:
- Input data causing the failure,
- Pipeline logs and error messages,
- Expected vs. actual results.
- This reduces Mean Time to Resolution (MTTR) during incidents.

Example Workflow of a Testing Service#

Prepare Test Data: - Use the pre-prepared datasets you’ve designed for testing edge cases, data quality, and pipeline validation.
Ingest Test Data: - Run the pipeline using the test data in the landing zone.
Validate Results at Each Layer: - Ensure the Bronze Layer confirms schema conformity. - Test transformations in the Silver Layer. - Validate KPIs/aggregations in the Gold Layer.
Log Results: - Automate the process of comparing expected outcomes with actual results. - Log validation passes or failures for reporting.
Visualize Results and Metrics: - Integrate results into dashboards to monitor test cases, performance trends, and data quality metrics.

Tools and Frameworks for Testing Services#

Databricks native tools:
PySpark and Scala for scripting validations.
Databricks SQL for querying and validating results.
Testing libraries:
Pytest, unittest for modular, repeatable Python-based tests.
Delta Expectations or Great Expectations for data quality testing.
CI/CD tools:
Jenkins, GitHub Actions, Azure DevOps for automating test runs.
Monitoring tools:
Databricks Lakehouse monitoring for pipeline insights.

In conclusion, a testing service enables consistent, scalable, and automated testing for Databricks pipelines, minimizes risk, and ensures high-quality data ingestion and transformations. By leveraging your pre-existing test data and automated testing logic, you can significantly improve the reliability of your data pipelines.

In conclusion, a testing service enables consistent, scalable, and automated testing for Databricks pipelines, minimizes risk, and ensures high-quality data ingestion and transformations. By leveraging your pre-existing test data and automated testing logic, you can significantly improve the reliability of your data pipelines.

4 - Create Data Catalog Objects#

The creation of Databricks data catalogue objects involves registering datasets and defining metadata for the ingested layers. This step helps document datasets and ensures discoverability, governance, and standardization. The catalogue provides a centralized way to manage and access the ingested data, enabling ease of use for downstream users and processes.

Below is an example of how you can create Data Catalogue objects in Databricks using Delta Lake. This involves registering a table in the Databricks metastore and defining metadata to make it discoverable and manageable for downstream data engineers and analysts.

-- Step 1: Create a Bronze Delta Table in Databricks Metastore
CREATE TABLE IF NOT EXISTS bronze_table (
    id STRING,
    name STRING,
    email STRING,
    event_timestamp TIMESTAMP
)
COMMENT 'This is the raw data table for storing ingested event data in the Bronze layer.'
USING DELTA
LOCATION '/mnt/bronze_layer/bronze_table';

-- Step 2: Add Additional Metadata to the Table (Optional but recommended)
-- Use COMMENTs or table constraints to document the data
ALTER TABLE bronze_table ADD CONSTRAINT valid_email CHECK (email LIKE '%@%');

-- Step 3: Verify that the Table is Registered in the Data Catalogue
SHOW TABLES IN default;  -- Replace "default" with the database being used

-- Step 4: (Optional) View the Table Metadata in the Data Catalogue
DESCRIBE EXTENDED bronze_table;

Explanation#

Table Registration: - The CREATE TABLE statement registers the table bronze_table into the Databricks Unity Catalog or Hive Metastore, depending on your environment. - The USING DELTA clause ensures that this table is backed by the Delta Lake format, providing transactional guarantees.
Metadata Documentation: - The COMMENT clause describes the purpose of the table in human-readable terms, helping other users understand the business or technical context of the dataset. - Example Comment: "This is the raw data table for storing ingested event data in the Bronze layer."
Defining Constraints: - Adding constraints like CHECK (e.g., validating email format, non-negative values) creates a layer of quality control and further documents the table's purpose.
Discoverability: - The SHOW TABLES command lists registered tables in a given database. - Tools like Unity Catalog enable cataloging tables with meaningful descriptions, ownership, and usage tracking for better governance.
Accessing Metadata: - The DESCRIBE EXTENDED command allows users to view detailed metadata such as table type, schema, storage location, and creation timestamp.
Physical Storage: - The LOCATION clause in the CREATE TABLE command explicitly defines where the Delta table's data is stored in a cloud object storage location (e.g., /mnt/bronze_layer/bronze_table).

By creating these Data Catalogue objects and attaching clear metadata, you enable organized, discoverable, and governed data assets that others can efficiently consume without needing to dive into the raw data structures.

Example Output#

Table Registration Verification

+-----------+-------------+-----------+
| database  | tableName   | isTemporary |
+-----------+-------------+-----------+
| default   | bronze_table| false      |
+-----------+-------------+-----------+

Metadata Example (DESCRIBE EXTENDED)

+-----------------------------+------------------------------+-------+
| col_name                   | data_type                    | comment|
+-----------------------------+------------------------------+-------+
| id                         | string                       |       |
| name                       | string                       |       |
| email                      | string                       |       |
| event_timestamp            | timestamp                    |       |
| # Detailed Table Information                            |       |
| Location:                  | /mnt/bronze_layer/bronze_table|       |
| Serde Library:             | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|       |
| Table Type:                | Managed                      |       |
| Comment:                   | This is the raw data table...|       |
+-----------------------------+------------------------------+-------+

This process ensures that your data objects are well-documented, accessible, and ready for governance in Databricks or other big data ecosystems.

Sample : Create the data schema objects in the data catalog (ddl scripts)

5 - Provision Bronze Outbound Interface#

The Bronze layer outbound interface is a view created on top of raw Delta Lake tables that abstracts technical complexities and provides a clean, streamlined layer of access. This view is provisioned to expose curated versions of the ingested raw data, making it easier for engineers and analysts to perform validations and initial data exploration. The outbound interface simplifies querying, reduces redundancy, and ensures data accuracy.

Below is a simple example of how you can create a Bronze Layer outbound interface in Databricks using Delta Lake. This example assumes that the Bronze Layer is used to store raw data and that the outbound interface is implemented as a SQL view sitting on top of a Delta Lake Bronze table.

-- Step 1: Create the Bronze Delta Table (if it doesn't already exist)
CREATE TABLE IF NOT EXISTS bronze_table (
    id STRING,
    name STRING,
    email STRING,
    event_timestamp TIMESTAMP
)
USING DELTA
LOCATION '/mnt/bronze_layer/bronze_table';

-- Insert some mock test data into the Bronze Table for testing purposes
INSERT INTO bronze_table VALUES
    ('1', 'John Doe', 'john.doe@example.com', CURRENT_TIMESTAMP),
    ('2', 'Jane Smith', 'jane.smith@example.com', CURRENT_TIMESTAMP);

-- Step 2: Create the Bronze Layer Outbound Interface as a SQL View
CREATE OR REPLACE VIEW bronze_table_view AS
SELECT
    id,
    name,
    email,
    CAST(event_timestamp AS DATE) AS event_date,   -- Example transformation
    event_timestamp
FROM bronze_table;

-- Step 3: Query the View (for Testing)
SELECT * 
FROM bronze_table_view;

Explanation#

Bronze Table: - The bronze_table stores raw ingested data with minimal or no transformations. - It's backed by Delta Lake and is physically stored at the specified location (/mnt/bronze_layer/bronze_table).
Outbound Interface (View): - The bronze_table_view is a SQL view sitting on top of the raw Bronze Delta table. - The view performs minor transformations, such as converting the event_timestamp column into an easier-to-read field (event_date). - By decoupling the table from direct queries, the view offers flexibility for future changes without impacting downstream consumers.
Test Data: - Example test data is inserted into the Bronze table for testing the outbound interface functionality.
Query the View: - A simple SELECT statement retrieves data from the bronze_table_view to ensure it's functioning correctly, encapsulating the logic applied on top of the raw Bronze data.

This example illustrates how a Bronze Layer outbound interface can be created and used in Databricks with Delta Lake. You can enhance this further by introducing more advanced transformations or applying filters to expose only the data needed by downstream processes.

6 - Provision Silver and Gold Outbound Interface#

The Silver and Gold layers represent progressively refined and aggregated versions of the ingested data. In this step, engineers provision views on top of the Delta Lake tables in these layers, creating access points for processed and business-ready data. Silver views provide cleaned, enriched, and validated data, while Gold views may host aggregated, business-critical data customized for analytics or reporting needs. These outbound interfaces serve as trusted sources for downstream applications and end users.

Your silver and gold interfaces would be created as per above.

Go up

Go to Playbook Main Page
Next: Data Ingestion Phase - Source to Landing Zone
Next: Data Ingestion Phase - Landing Zone to Bronze