Engineering Gaps and Recommendations - Ingestion#

Engineering Gaps and Recommendations - Ingestion
Overview
Ingestion Gaps and Recommendations

Overview#

The diagram and section below is a high-level overview of the azure databricks data platform as highlighted in the [to-be](../architecture/target-architecture/to-be-current-architecture-description.md#ingestion-recommendations) architecture deliverable with the respective area being highlighted for the gaps and recommendations.

Hold "Alt" / "Option" to enable Pan & Zoom

Ingestion Gaps and Recommendations#

Type	Priority	Area	Description	Recommendation / Justification	Reference to standards	Workshop Notes
Gap	1	Landing Zone	There is documentation on how to use the landing zone. We could not find guidelines / principles for which use cases landing zone can skipped and when it should be used.	Document usage and principles for landing zone. Document detailed principle and usage. Without clear guidelines and principles, there is too much temptation to skip landing zone to save development time. Long term such a shortcut will have a negative impact on the over data product.	Jonatan says we are not using the landing zone term consistently, its not a stand alone thing. Everyone who gets a catalogue gets a landing zone. we ust have a clear clarification and where it should be. Combination of architecture and engineer.
Gap	1	Landing Zone	Standardization on landing zone organization and naming structure. There is an example of the landing zone organization, however we could not find standards for organization of the landing zone and file naming conventions.	Document and publish naming standards for landing zone structure. Define landing zone file and folder structure and naming conventions. Landing zone standardization facilitates building re-usable pipelines.	View NN Landing Zone
Recommendation		Landing Zone	Consider an updated folder structure from what is currently documented.	We recommend considering adjusting the folder structure used for landing zone to use different folders for the different tables and include the timestamp in the filename. By adding the table name in the folder structure you grant more granular access on the table level, it is easy to use for auto loader and DTL. The periodic batch process to support the landing zone clean up's are simplified as the filenames contains the datetime stamp, The current structure means that there are more complex scripts are restructured as we need to recursively iterate the folders.
Recommendation		Landing Zone	Custom python code is used to identify files that should be processed.	Writing custom file load tracking mechanisms requires extra time and effort. Rely on autoloader capability to track loaded files where possible.
Recommendation	1	Landing Zone	Keep in the landing zone all files(pending for load, loaded, failed). Periodically clean the landing zone via batch process.	Use built in load tracker like autoloader to identify which files have already been processed. Introduce runtime meta data to record status of the data ingestion job. Keeping all the data simplifies error handling , allows to easily re-process data.
Recommendation	2	Ingestion Triggering	Mechanism to trigger ingestion pipeline. In some cases data sources are monitored for change by periodic queries instead of relying on handling trigger events	Use an enterprise grade 3rd party tool when orchestration across platforms is required. Consider events / signals to trigger pipelines. These events must be fed into the orchestration layer for processing. Periodic querying of datasource to identify when pipeline should be executed requires extra compute Resources and costs extra money.	View Standards : This topic is not relavant.
Recommendation	2	Meta Data management	Define standards for common meta data attributes.	Introduce standards for meta data tables and attributes, standardize names of data providers, sources, ingestion type etc.	View Standards and here
Recommendation	2	Meta Data management	Meta Data storage	Store meta data in a dedicated schema within databricks instead of configuration files , have a common database containing all the meta data for all sources. Currently done using yaml files. This is an affective way to manage version of this meta data, this is re-usable across different projects.	View Standards and here
Recommendation	1	Landing Zone	Use'Add only' mode on the landing zone Some ingestion's use append mode	Use add mode only in landing zone. Reduces the complexity of data management by avoiding the need for complex update and delete operations	View
Recommendation	1	Keep all files with both correct and incorrect data in landing zone	Incorrect data in the landing zone and bronze layer is also data that needs to be stored.	Add correct and incorrect files in the landing zone. Mark incorrect data with common metadata attributes. This is useful	View
Recommendation	1	Bronze data storage	Store as close to raw as possible with only appending additional meta data and storing in delta format.	If its a possibility , don't de-duplicate data in bronze because your de-duping my include bugs / or there are ingestion issues, you now have the ability to re-play the transform from bronze. While this may have an impact on overall size, the sizing is small compared to the reprocessing costs.		View
Recommendation	3	Full Data loads	Avoid full data loads where possible.	Full loads create considerable overhead. Design for incremental loads first.	View
Recommendation	1	Soft Deletes	Data history	Never delete the data, if the data is deleted, mark the record in bronze as deleted. Physically deleting the data can break referential integrity.	View
Recommendation	3	Custom Ingestion	Using custom python for ingestion logic increases complexity and error prone.	Where possible always use a pre built ingestion tool like DLT hub	View
Recommendation	3	Separate ETL code and DDL creation	Create data objects in the catalogue as part of the DDL scripts that execute during code deployment and not as part of the python code for the pipeline.	Run ddl scripts as part of deployment.	View
Recommendation	2	Data Versioning	Every file in the landing zone should be versioned. Each version reflects data structure in the file.	Include the version as part of the file name.

The table below suggests which framework is best suited based on the Data Source Type.

Tool/Framework	Best Use Case	Data Source Type
Delta Live Tables (DLT)	Automating ETL processes, handling schema evolution, and data quality checks	Structured data from databases, semi-structured data from JSON, CSV, Parquet files
Auto Loader	Incremental and efficient processing of new data files as they arrive	Cloud object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage)
Databricks Connectors	Directly reading and writing data between Databricks and databases	Databases (Azure SQL Database, Azure Synapse Analytics, Cosmos DB)
DLT Hub	Zero-configuration setup for Delta Live Tables, simplifying pipeline creation	Structured and semi-structured data from various sources

Back to Target Architecture for ingestion
Back to As-Is Architecture for ingestion