Skip to content

Engineering Gaps and Recommendations For Data Engineering Common Components#

Back to Menu

Version Control#

Version Date Owner Change Description
0.1 18 March 2025 Gareth Stretch Initial Framework created
0.2 23 March 2025 Gareth Stretch Moved Gaps to its own page

Common Components#

The diagram and section below is a high-level overview of the azure databricks data platform as highlighted in the to-be deliverable with the respective area being highlighted for the gaps and recommendations.

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Monitoring#

Definition : Enabling the end to end monitoring of all technology components within the data platform and not just databricks. This includes

  • Behavior of all technology components that are used in the pipeline.
  • The status and result of the pipeline execution.
  • Telemetry is used to show execution time for every individual component.
  • Cost monitoring in not included in overall monitoring topic as it falls within the platform governance and cost area.

Gaps and recommendations

Type Priority Area Gap / Recommendation Recommendation
Recommendation 1 Pipeline Monitoring Only monitoring the Databricks jobs, not end to end pipeline monitoring that includes multiple technology components Use a data monitoring tooling that would provide end to end monitoring and not only databricks monitoring. Often data monitoring capabilities are included in end to end orchestration tooling like Dagster / Airflow.
Recommendation 1 Meta data The Meta data needed for observability is not stored, eg request ID, Filename, Source Provider Name etc Define a meta data strategy to ensure that meta data is available that allows end to end lineage and monitoring. Consider storing meta data in separate schema on Databricks.
Recommendation 1 Underdeveloped meta data management There is no common approach to meta data management and governance, e.g some stored in yaml, no meta data applied in some ingestions. There is no standard. Define a meta data strategy to ensure that meta data is available that allows end to end lineage and monitoring.

Data Validation#

Definition : Data validation is the process to identify and report violation of data quality rules defined in the data contract.

Gaps and Recommendations

Type Priority Area Description Recommendation
Gap 1 Reusability While there is various techniques recorded for data validation, there is no common documented scalable approach for data validation, nor are there any templates or recommendations provided Build a set of re-usable templates for the core data validations covering at least 'source to landing' layer and 'landing to bronze' layers.
Gap 2 Data Quality Metrics Data quality metrics are generally documented in the git repo per solution. However there is no common definition of data quality metrics and actions on degraded metrics. Define and document Data Quality Metrics which include actions to be taken should a metric drop below the defined standard.
Gap 2 Data Validation tools Data validation is done manually which limits scalability. Use a data validation tool, e.g Data validation that is included in Delta Live Tables where appropriate.
Lakehouse Monitoring allows you to profile, diagnose, and enforce data quality directly within the Databricks platform.
Data Profiling: Automatically profile data to identify trends and anomalies.
Quality Enforcement: Set up rules to enforce data quality standards.
Dashboards: Visualize data quality metrics and trends using auto-generated dashboards. For more sophisticated data validation, consider external tools like Great Expectations,
if you are using DBT consider using DBT Expectations which is part of the DBT package.
Recommendation Data Quality - Only basic data quality validation is performed during data load in the pipeline
- Data is not validated against acceptable ranges - Store lookup values for referencing the validation rules (acceptable ranges) in a table.
- Uniqueness of business keys is not validated
Use external tooling as suggested above.

Exception Handling#

Definition : Exception handling refers to the methods and practices used to act upon errors and unexpected conditions that occur during the execution of code, particularly in notebooks and jobs.

Gaps and Recommendations

Type Priority Area Description Recommendation
Recommendation 2 Exception Handling Strategy A common documented approach for exception handling is missing. Document a detailed exception handling process. Each exception is classified as fatal or not fatal,
Fatal exceptions must terminate the data load and rollback state to preload condition.
Non Fatal exceptions are logged during the dataload both in the logfile and metadata attributes.
Special Dashboards should be available to provide overview of both fatal and non fatal exceptions, these should be defined at the pipeline / source system level. It's recommended to filter out exceptions and good values into separate data frames for monitoring and further processing.
Recommendation 2 Exception handling coverage Try / catch blocks only cover parts of the process. Extend the process for catching and logging exceptions both from code execution process and data validation process.

Testing#

Definition : Testing refers to the process of verifying that your code, data pipelines, and data models work as expected. This ensures the reliability, accuracy, and performance of your data workflows.

Gaps and Recommendations

Type Priority Area Gap Recommendation
Recommendation 2 Comprehensive Testing Lack of Comprehensive Testing, Not having enough unit tests to cover all possible scenarios results in undetected errors Adopt Test-Driven Development (TDD): Implement TDD practices to ensure thorough testing from the outset, however this doesn't ensure that you have comprehensive test coverage, it is further recommended to document and implement both negative and positive testcases.
Recommendation 2 Custom Testing Testing code is generally hand rolled and prone to errors. Use testing frameworks as a standard implementation approach. refer to list below.
Recommendation 2 Automated Testing Automated regression testing required for scalable development Define a comprehensive testing strategy which includes automated testing, tooling and frameworks.
Recommendation 1 Test Data usage of production data has adverse affects when it comes to testing. Often production data covers mostly the positive test case scenarios It is recommended to have a testing suite that can generate test data for each defined scenario, this should include both positive and negative test data.
Framework Strengths Weaknesses Cost
pytest - Widely used and well-documented
- Supports fixtures and parameterized tests
- Integrates well with CI/CD tools
- May require additional setup for Spark-specific testing Free
Nutter - Designed specifically for Databricks notebooks
- Easy to integrate with Azure DevOps
- Simplifies testing of notebooks
- Limited to Databricks notebooks
- Less flexible for non-notebook code
Free
unittest - Built-in Python library
- Simple and easy to use
- Good for basic unit testing
- Less feature-rich compared to pytest
- May require additional setup for complex tests
Free
doctest - Allows you to embed tests in docstrings
- Simple and easy to use
- Good for documentation-driven testing
- Not as powerful as other frameworks
- Limited to simple test cases
Free
hypothesis - Supports property-based testing
- Automatically generates test cases
- Integrates with pytest
- Can be complex to set up initially
- Requires understanding of property-based testing
Free

Data Maintenance#

Definition : Data maintenance covers both archiving and compliance requirements that include the right to be forgotten process, GDPR, POPIA and general storage retention topics. Archiving and storing data in cheaper storage tiers on the Databricks platform involves several strategies to optimize costs while maintaining data accessibility when needed. This is relevant to landing zone, bronze, silver and gold layers.

Gaps and Recommendations |Type |Priority | Area | Gap | Recommendation | |--- |--- |--- |--- |--- | |Recommendation|1 | Partitioning and Filtering | Limited data partitioning strategy | Organize your data into partitions based on time or other relevant criteria. This makes it easier to archive older partitions while keeping recent data in more accessible storage tiers. Use Common meta data columns to facilitate implementation of data lifecycle policies. | |Recommendation|2 | Data Backups | No common approach for backup and recovery. | Use time travel to backup and restore within 1 week, and user periodic backups to be able to restore historical data outside of the 1 week period. |

CICD / Devop's#

CI/CD (Continuous Integration and Continuous Delivery) refers to the automated process of developing, testing, and deploying code changes in a consistent and reliable manner. This approach helps streamline the development lifecycle, ensuring that new features and updates are delivered quickly and with high quality. The list below is what is covered.

Gaps and Recommendations

Type Priority Area Gap Recommendation
Recommendation Testing automation Testing is not fully automated, therefore limits full CI/CD processes. Define and implement automated testing strategy and train the data engineers on this process.
Recommendation Automated Testing Automated Integration and Regression testing. Unable to find automated regression testing. (verify...)
Recommendation Storage Separation Separate compute and storage subscription : Use for all business areas, a common subscription for storage subscription - This avoids migrating data across subscription if ownership changes