Skip to content

Engineering Gaps and Recommendations - Ingestion#

Back to Menu

Overview#

The diagram and section below is a high-level overview of the azure databricks data platform as highlighted in the [to-be](../architecture/target-architecture/to-be-current-architecture-description.md#ingestion-recommendations) architecture deliverable with the respective area being highlighted for the gaps and recommendations.

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Ingestion Gaps and Recommendations#

Type Priority Area Description Recommendation / Justification Reference to standards Workshop Notes
Gap 1 Landing Zone There is documentation on how to use the landing zone. We could not find guidelines / principles for which use cases landing zone can skipped and when it should be used. Document usage and principles for landing zone. Document detailed principle and usage.

Without clear guidelines and principles, there is too much temptation to skip landing zone to save development time. Long term such a shortcut will have a negative impact on the over data product.
Jonatan says we are not using the landing zone term consistently, its not a stand alone thing. Everyone who gets a catalogue gets a landing zone. we ust have a clear clarification and where it should be. Combination of architecture and engineer.
Gap 1 Landing Zone Standardization on landing zone organization and naming structure. There is an example of the landing zone organization, however we could not find standards for organization of the landing zone and file naming conventions. Document and publish naming standards for landing zone structure. Define landing zone file and folder structure and naming conventions.

Landing zone standardization facilitates building re-usable pipelines.
View
NN Landing Zone
Recommendation Landing Zone Consider an updated folder structure from what is currently documented. We recommend considering adjusting the folder structure used for landing zone to use different folders for the different tables and include the timestamp in the filename.
By adding the table name in the folder structure you grant more granular access on the table level, it is easy to use for auto loader and DTL.
The periodic batch process to support the landing zone clean up's are simplified as the filenames contains the datetime stamp, The current structure means that there are more complex scripts are restructured as we need to recursively iterate the folders.
Recommendation Landing Zone Custom python code is used to identify files that should be processed. Writing custom file load tracking mechanisms requires extra time and effort. Rely on autoloader capability to track loaded files where possible.
Recommendation 1 Landing Zone Keep in the landing zone all files(pending for load, loaded, failed). Periodically clean the landing zone via batch process. Use built in load tracker like autoloader to identify which files have already been processed. Introduce runtime meta data to record status of the data ingestion job.

Keeping all the data simplifies error handling , allows to easily re-process data.
Recommendation 2 Ingestion Triggering Mechanism to trigger ingestion pipeline. In some cases data sources are monitored for change by periodic queries instead of relying on handling trigger events Use an enterprise grade 3rd party tool when orchestration across platforms is required. Consider events / signals to trigger pipelines. These events must be fed into the orchestration layer for processing.

Periodic querying of datasource to identify when pipeline should be executed requires extra compute Resources and costs extra money.
View Standards : This topic is not relavant.
Recommendation 2 Meta Data management Define standards for common meta data attributes. Introduce standards for meta data tables and attributes, standardize names of data providers, sources, ingestion type etc. View Standards and here
Recommendation 2 Meta Data management Meta Data storage Store meta data in a dedicated schema within databricks instead of configuration files , have a common database containing all the meta data for all sources. Currently done using yaml files. This is an affective way to manage version of this meta data, this is re-usable across different projects. View Standards and here
Recommendation 1 Landing Zone Use'Add only' mode on the landing zone Some ingestion's use append mode Use add mode only in landing zone. Reduces the complexity of data management by avoiding the need for complex update and delete operations View
Recommendation 1 Keep all files with both correct and incorrect data in landing zone Incorrect data in the landing zone and bronze layer is also data that needs to be stored. Add correct and incorrect files in the landing zone. Mark incorrect data with common metadata attributes. This is useful View
Recommendation 1 Bronze data storage Store as close to raw as possible with only appending additional meta data and storing in delta format. If its a possibility , don't de-duplicate data in bronze because your de-duping my include bugs / or there are ingestion issues, you now have the ability to re-play the transform from bronze. While this may have an impact on overall size, the sizing is small compared to the reprocessing costs. View
Recommendation 3 Full Data loads Avoid full data loads where possible. Full loads create considerable overhead.

Design for incremental loads first.
View
Recommendation 1 Soft Deletes Data history Never delete the data, if the data is deleted, mark the record in bronze as deleted.

Physically deleting the data can break referential integrity.
View
Recommendation 3 Custom Ingestion Using custom python for ingestion logic increases complexity and error prone. Where possible always use a pre built ingestion tool like DLT hub View
Recommendation 3 Separate ETL code and DDL creation Create data objects in the catalogue as part of the DDL scripts that execute during code deployment and not as part of the python code for the pipeline. Run ddl scripts as part of deployment. View
Recommendation 2 Data Versioning Every file in the landing zone should be versioned. Each version reflects data structure in the file. Include the version as part of the file name.

The table below suggests which framework is best suited based on the Data Source Type.

Tool/Framework Best Use Case Data Source Type
Delta Live Tables (DLT) Automating ETL processes, handling schema evolution, and data quality checks Structured data from databases, semi-structured data from JSON, CSV, Parquet files
Auto Loader Incremental and efficient processing of new data files as they arrive Cloud object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage)
Databricks Connectors Directly reading and writing data between Databricks and databases Databases (Azure SQL Database, Azure Synapse Analytics, Cosmos DB)
DLT Hub Zero-configuration setup for Delta Live Tables, simplifying pipeline creation Structured and semi-structured data from various sources

Back to Target Architecture for ingestion
Back to As-Is Architecture for ingestion