Skip to content

Landing Zone Setup

What is landing zone#

Landing zone refers to the initial, structured area where raw data is collected and stored before it undergoes transformation, processing and analysis in bronze/silver/gold layers. It acts as a staging ground for incoming data from various sources ensuring it is safely captured, catalogued and ready for downstream workflows.

Landing zone in Datacore are called volumes which can be Managed or External. Please refer here for more information.

Managed volume is part of the Databricks Managed Node, it is created as part of Managed Node subscription and administered by platform team. Managed volume should be used if data product team does already have landing zone that they would like to keep and do not have any specific requirements for data lifecycle management, security, backup and recovery.

External volume is a blob storage outside databricks Managed node and managed by product team. Both managed and external are registered as external locations in the unity catalog and mapped to external volumes defined in data catalog for bronze layer.

Landing zone is used only as part of the data ingestion for data product. Data from landing zone is append only file storage that can not be directly shared. It is subjected to data retention policy based on which the data is stored for a specified time and gets moved to cheaper archive storage later.

Landing zone structure#

The below landing zone structure serves as a reference and can be tailored to meet specific project or organizational needs. The landing zone folder structure considers failed and success structures.

/Volumes/<catalog_name>/<schema>/<incoming_volume_name>
├──  <data_provider_1>
├──  <data_provider_2>
|    .
|    .
/Volumes/<catalog_name>/<schema>/<archive_volume_name>
└── <YYYY_MM_DD>
    ├── <data_provider_1>
    ├── <data_provider_2>
    |   .
    |   .

In the context of the Azure Databricks deployment, the following is noted regarding the landing zone structure.

  • catalog_name -- Name of the collection that includes metadata about the data sets stored in the landing zone
  • incoming_volume_name - Storage solution (such as Azure Blob Storage, Azure Data Lake Storage, or another storage solution) where the raw data is kept
  • archive_volume_name - Storage solution (such as Azure Blob Storage, Azure Data Lake Storage, or another storage solution) where the archived data is kept
  • data_provider -- Each data provider has its own directory
  • YYYY_MM_DD -- Archive date

Why use a landing zone#

A landing zone serves as the initial staging area in the medallion architecture (landing → bronze → silver → gold layers).It provides:

  • A buffer layer for raw data before processing and transformation
  • Flexibility in ingestion methods - supporting batch, streaming, API, and external data sources
  • Security - especially important when ingesting data from outside the network
  • Data staging - allows for validation and quality checks before moving data to bronze layer

When to use a landing zone#

Kindly refer here for various data ingestion design patterns (mentioned below)

  • Batch Data Loads (Pattern 1) - When ingesting large volumes of data in batches from databases or file systems
  • API Integrations (Pattern 4) - When pulling data using external APIs that need a staging area before processing
  • External/Third-Party Data (Pattern 5) - Particularly important when receiving data from outside the organization's network, where security and validation are critical, specifically for vendor data transfers using Partner Data Transfer (PDT) capabilities
  • Some Streaming Scenarios (Pattern 3) - When using Kafka for streaming, data can go to either landing zone or directly to bronze, depending on requirements

When NOT to use a landing zone#

  • Using Lakeflow Connect (Pattern 2) - This allows streaming data directly to the bronze layer, skipping landing entirely
  • Certain Kafka streaming setups (Pattern 3) - Where real-time processing requirements may justify direct bronze ingestion

The key takeaway is that while landing zones are the "most common pattern" for data ingestion, the architecture allows flexibility to bypass them when real-time processing or specific tooling (like Lakeflow Connect) makes direct bronze ingestion more efficient.