Data Engineering Guardrails and Best Practices - Data Landing Zones#

Data Engineering Guardrails and Best Practices - Data Landing Zones
Version Control
Definition
Data Landing Zones and Bronze Layer

Version Control#

Version	Date	Owner	Change Description
0.1	18 March 2025	Gareth Stretch	Populated benefits and guardrails for Landing zone

Definition#

In the context of the Databricks data platform, a landing zone refers to a designated storage area where raw data is initially ingested and stored before any processing or transformation occurs. This concept is crucial for managing data pipelines and ensuring data quality and integrity

A data landing zone is an intermediate storage area used during the extract, transform, and load (ETL) process. It serves as a staging area where data from various sources is collected and temporarily stored before being processed and moved to its final destination, such as a data warehouse or data lake

Data Landing Zones and Bronze Layer#

Definition Landing Zone: Landing zone contains files ingested from external data sources. Data is stored in a file format of the source system and usually structured according to the source type, source system and landing date.

Definition of Bronze Layer : The bronze layer contains the data in the structure and same content as the source system stored in delta tables. The bronze layer usually contains additional common meta-data attributes which facilitates data handling, observability and data quality management. [View Recommendation Here] (link to rustems page.)

Having both a landing zone and a Bronze layer in a Databricks platform offers several benefits, enhancing data management, quality, and processing efficiency and security.

Benefits of bronze layer and landing zone#

Bronze Layer Only :

Data Lineage: The Bronze layer maintains a historical archive of raw data, preserving its original structure and metadata. This is crucial for data lineage and auditing.
Reprocessing: It allows for easy reprocessing of data without needing to re-ingest from the source systems, saving time and resources

Combined Landing Zone and Bronze Layer :

Validation and Cleansing: The landing zone performs initial validation and cleansing, while the Bronze layer ensures data integrity and historical tracking
Improved Reliability: Together, they enhance the reliability of the data pipeline, ensuring that high-quality data is available for downstream processing
Efficient Workflow: The separation of raw data ingestion (landing zone) and initial storage (Bronze layer) optimizes the data processing workflow, making it more efficient and manageable
Scalability and Performance: Both layers can scale independently, ensuring that the data pipeline can handle increasing volumes and complexity without performance degradation

Guardrails#

Topic	Description	Justification
Bronze Layer - add Only Layer	Do not update or delete data from the bronze layer	Retains full history. Able to satisfy all consumer requirements(different consumers may want different data versions)
Raw Data Storage	The landing zone stores raw, unprocessed data directly from source systems. This data is often in its original format and may include various types of data such as logs, transaction records, or sensor data	Storing data in its original format ensures that the data remains unaltered from its source, preserving its integrity and authenticity. Original raw data provides a reliable audit trail, making it easier to trace back to the source and verify the data's accuracy. Raw data can be processed in various ways to meet different analytical needs. By keeping the original format, you retain the flexibility to apply different transformations and analyses as required. If new processing techniques or requirements arise, having the raw data allows you to reprocess it without needing to re-ingest from the source. Storing raw data allows for initial validation and quality checks before any transformations are applied. This helps in identifying and correcting errors early in the data pipeline
Data Quality Checks	The landing zone is often used to perform initial data quality checks, such as validating data formats, checking for missing values, and identifying duplicates	Validating raw data helps ensure that the data is accurate and free from errors. This is crucial for making reliable business decisions. Quality checks ensure that the data is in the correct format and structure for subsequent transformations, reducing the risk of processing failures
Access Controls	Implement strict access controls to ensure that only authorized personnel can access the landing zone. Use role-based access control (RBAC) and encryption to protect sensitive data	Preventing unauthorised access.
Domain Aligned	seperate the landing zone aligned to business domains or source system domains(HR, sale, products)	Alignment with Business Processes: This approach aligns data storage with business functions, making it easier for business users to find and understand the data relevant to their domain. Improved Collaboration: Teams working within the same business domain can collaborate more effectively, as all relevant data is centralized

Best Practice#

Topic	Description	Recommendation
Folder Structure	Source-Based Folders: Create separate folders for different data sources (e.g., crm, erp, web_logs)	This separation helps in managing data ingestion processes and ensures that data from different sources and formats is organized logically.NB!!Avoid creating too many nested subdirectories. Keep the structure as flat as possible while maintaining organization
Folder Structure	Data Type Folders: Further organize by data type within each source (e.g., json, csv, parquet)
Folder Structure	Table Name Folders: Further organize by table within each source	More Granular access control at the table level.
Folder Structure	Date Hierarchy: Organize data by date, using a hierarchy such as year/month/day (e.g., 2025/04/14)	This approach optimizes query performance and data management, especially for time-series data, by enabling efficient data retrieval and storage
Retention Policy	Implementing a retention policy for a Databricks landing zone involves several steps to ensure that data is managed efficiently and complies with organizational policies.	Define Retention Requirements : Determine the retention period for different types of data based on regulatory requirements, business needs, and data usage patterns.
Transient	Data in the landing zone is typically transient, meaning it is stored temporarily until it is processed and moved to a more permanent storage location	Implement a data retention policy and archiving process
Preserve Original Data	Store raw data as-is, including all metadata such as ingestion timestamps. This ensures that the original data is available for auditing and reprocessing	Original raw data provides a reliable audit trail, making it easier to trace back to the source and verify the data's accuracy. Raw data can be processed in various ways to meet different analytical needs. By keeping the original format, you retain the flexibility to apply different transformations and analyses as required.
Separate Storage	Use a distinct storage location for the landing zone to isolate raw data from processed data. This helps in managing access and security more effectively	By separating the landing zone, you can apply specific security policies and access controls to raw data, ensuring that sensitive information is protected from unauthorized access. additionally Different data types may have different compliance requirements. Isolating the landing zone allows you to enforce compliance measures specific to the raw data before it is processed

Examples : Bronze / Landing Zone structure#

The below structure seperates the data by system, then by table / document name with the system and then finally the data type. Within the final folder, all version of the file exists but being structured according the date and time of when the data was extracted by the source system and includes a sequence number keeping the file unique.

NB! you must be able to understand the different between the file generation date vs the data modification date.

Autoloader : Cant monitor the folder structure for completion Custom Python : Need to handle the pickup manually

*use a status file as a mechanism to guarantee transfer of large files.

/base_folder/
    /**landing**/ -- > Used for the landing area of the ingestion job : when the job is completed, the files are moved to the processed area.
        /**crm**/
            /customer{table_name}/
                    /json/
                        /interface_type{full_load,delta_load}/                   
                                customer_data_20250414_23.json           
   /**processed**/ --> The data pipelines start from here and process to bronze. by processing jobs from processed folder, you avoid file contention when file loads are not complete in the landing folder.
        /**crm**/
            /customer{table_name}/
                    /json/
                        /interface_type{full_load,delta_load}/                   
                                customer_data_20250414_23.json
    /**error**/ --> Used for when there is an error loading from source to landing.
        /**crm**/
            /customer{table_name}/
                    /json/
                        /interface_type{full_load,delta_load}/                   
                                customer_data_20250414_23.json
    /**archive**/ --> used to store data based on defined data retention policies : Usually a stand alone process is run to copy from processed to archive.
        /**crm**/
            /customer{table_name}/
                    /json/
                        /interface_type{full_load,delta_load}/                   
                                customer_data_20250414_23.json