Engineering Gaps and Recommendations For Data Engineering Common Components#

Engineering Gaps and Recommendations For Data Engineering Common Components
Version Control
Common Components

Version Control#

Version	Date	Owner	Change Description
0.1	18 March 2025	Gareth Stretch	Initial Framework created
0.2	23 March 2025	Gareth Stretch	Moved Gaps to its own page

Common Components#

The diagram and section below is a high-level overview of the azure databricks data platform as highlighted in the to-be deliverable with the respective area being highlighted for the gaps and recommendations.

Hold "Alt" / "Option" to enable Pan & Zoom

Monitoring#

Definition : Enabling the end to end monitoring of all technology components within the data platform and not just databricks. This includes

Behavior of all technology components that are used in the pipeline.
The status and result of the pipeline execution.
Telemetry is used to show execution time for every individual component.
Cost monitoring in not included in overall monitoring topic as it falls within the platform governance and cost area.

Gaps and recommendations

Type	Priority	Area	Gap / Recommendation	Recommendation
Recommendation	1	Pipeline Monitoring	Only monitoring the Databricks jobs, not end to end pipeline monitoring that includes multiple technology components	Use a data monitoring tooling that would provide end to end monitoring and not only databricks monitoring. Often data monitoring capabilities are included in end to end orchestration tooling like Dagster / Airflow.
Recommendation	1	Meta data	The Meta data needed for observability is not stored, eg request ID, Filename, Source Provider Name etc	Define a meta data strategy to ensure that meta data is available that allows end to end lineage and monitoring. Consider storing meta data in separate schema on Databricks.
Recommendation	1	Underdeveloped meta data management	There is no common approach to meta data management and governance, e.g some stored in yaml, no meta data applied in some ingestions. There is no standard.	Define a meta data strategy to ensure that meta data is available that allows end to end lineage and monitoring.

Data Validation#

Definition : Data validation is the process to identify and report violation of data quality rules defined in the data contract.

Gaps and Recommendations

Type	Priority	Area	Description	Recommendation
Gap	1	Reusability	While there is various techniques recorded for data validation, there is no common documented scalable approach for data validation, nor are there any templates or recommendations provided	Build a set of re-usable templates for the core data validations covering at least 'source to landing' layer and 'landing to bronze' layers.
Gap	2	Data Quality Metrics	Data quality metrics are generally documented in the git repo per solution. However there is no common definition of data quality metrics and actions on degraded metrics.	Define and document Data Quality Metrics which include actions to be taken should a metric drop below the defined standard.
Gap	2	Data Validation tools	Data validation is done manually which limits scalability.	Use a data validation tool, e.g Data validation that is included in Delta Live Tables where appropriate. Lakehouse Monitoring allows you to profile, diagnose, and enforce data quality directly within the Databricks platform. Data Profiling: Automatically profile data to identify trends and anomalies. Quality Enforcement: Set up rules to enforce data quality standards. Dashboards: Visualize data quality metrics and trends using auto-generated dashboards. For more sophisticated data validation, consider external tools like Great Expectations, if you are using DBT consider using DBT Expectations which is part of the DBT package.
Recommendation		Data Quality	- Only basic data quality validation is performed during data load in the pipeline - Data is not validated against acceptable ranges - Store lookup values for referencing the validation rules (acceptable ranges) in a table. - Uniqueness of business keys is not validated	Use external tooling as suggested above.

Exception Handling#

Definition : Exception handling refers to the methods and practices used to act upon errors and unexpected conditions that occur during the execution of code, particularly in notebooks and jobs.

Gaps and Recommendations

Type	Priority	Area	Description	Recommendation
Recommendation	2	Exception Handling Strategy	A common documented approach for exception handling is missing.	Document a detailed exception handling process. Each exception is classified as fatal or not fatal, Fatal exceptions must terminate the data load and rollback state to preload condition. Non Fatal exceptions are logged during the dataload both in the logfile and metadata attributes. Special Dashboards should be available to provide overview of both fatal and non fatal exceptions, these should be defined at the pipeline / source system level. It's recommended to filter out exceptions and good values into separate data frames for monitoring and further processing.
Recommendation	2	Exception handling coverage	Try / catch blocks only cover parts of the process.	Extend the process for catching and logging exceptions both from code execution process and data validation process.

Testing#

Definition : Testing refers to the process of verifying that your code, data pipelines, and data models work as expected. This ensures the reliability, accuracy, and performance of your data workflows.

Gaps and Recommendations

Type	Priority	Area	Gap	Recommendation
Recommendation	2	Comprehensive Testing	Lack of Comprehensive Testing, Not having enough unit tests to cover all possible scenarios results in undetected errors	Adopt Test-Driven Development (TDD): Implement TDD practices to ensure thorough testing from the outset, however this doesn't ensure that you have comprehensive test coverage, it is further recommended to document and implement both negative and positive testcases.
Recommendation	2	Custom Testing	Testing code is generally hand rolled and prone to errors.	Use testing frameworks as a standard implementation approach. refer to list below.
Recommendation	2	Automated Testing	Automated regression testing required for scalable development	Define a comprehensive testing strategy which includes automated testing, tooling and frameworks.
Recommendation	1	Test Data	usage of production data has adverse affects when it comes to testing. Often production data covers mostly the positive test case scenarios	It is recommended to have a testing suite that can generate test data for each defined scenario, this should include both positive and negative test data.

Framework	Strengths	Weaknesses	Cost
pytest	- Widely used and well-documented - Supports fixtures and parameterized tests - Integrates well with CI/CD tools	- May require additional setup for Spark-specific testing	Free
Nutter	- Designed specifically for Databricks notebooks - Easy to integrate with Azure DevOps - Simplifies testing of notebooks	- Limited to Databricks notebooks - Less flexible for non-notebook code	Free
unittest	- Built-in Python library - Simple and easy to use - Good for basic unit testing	- Less feature-rich compared to pytest - May require additional setup for complex tests	Free
doctest	- Allows you to embed tests in docstrings - Simple and easy to use - Good for documentation-driven testing	- Not as powerful as other frameworks - Limited to simple test cases	Free
hypothesis	- Supports property-based testing - Automatically generates test cases - Integrates with pytest	- Can be complex to set up initially - Requires understanding of property-based testing	Free

Data Maintenance#

Definition : Data maintenance covers both archiving and compliance requirements that include the right to be forgotten process, GDPR, POPIA and general storage retention topics. Archiving and storing data in cheaper storage tiers on the Databricks platform involves several strategies to optimize costs while maintaining data accessibility when needed. This is relevant to landing zone, bronze, silver and gold layers.

CICD / Devop's#

CI/CD (Continuous Integration and Continuous Delivery) refers to the automated process of developing, testing, and deploying code changes in a consistent and reliable manner. This approach helps streamline the development lifecycle, ensuring that new features and updates are delivered quickly and with high quality. The list below is what is covered.

Gaps and Recommendations

Type	Area	Gap	Recommendation
Recommendation	Testing automation	Testing is not fully automated, therefore limits full CI/CD processes.	Define and implement automated testing strategy and train the data engineers on this process.
Recommendation	Automated Testing	Automated Integration and Regression testing. Unable to find automated regression testing.	(verify...)
Recommendation	Storage Separation	Separate compute and storage subscription :	Use for all business areas, a common subscription for storage subscription - This avoids migrating data across subscription if ownership changes