Skip to content

To be current architecture description

Target Azure Databricks Data Solution - Architecture Description#

Back to Menu

Components#

  • Ingestion: Using Apache Kafka for real-time data streaming and Fivetran for batch data ingestion.
  • Storage: Data lake on Azure Data Lake Storage and data warehouse on Snowflake.
  • Processing: Apache Spark on Databricks for batch processing and real-time analytics.
  • Transformation: ETL processes managed by Airbyte.
  • Orchestration: Apache Airflow for workflow management.
  • Analytics and BI: Power BI for data visualization and reporting.
  • Data Governance: Implementing data quality checks and lineage tracking with Collibra.

Integration#

  • APIs: REST APIs for integrating with external systems.
  • Data Flows: Data flows from ingestion to storage, processing, and analytics.

Security#

  • Access Controls: Role-based access control (RBAC) and fine-grained permissions.
  • Encryption: Data encrypted at rest and in transit using AES-256.
  • Compliance: GDPR and HIPAA compliance ensured through data governance practices.

Naming Standards :#

This section documents the naming standards and recommendations per area that should exist.

  • Landing zone folder structure : click here
  • git repos, pipelines :
  • Database Objects (databases and Schema's) :
  • Tables and Views :
  • Columns :
  • Indexes and Constraints :
  • Stored Procedures and Functions :
  • Triggers :
  • Databricks jobs :
  • Security and roles :
  • service principles :
  • resource groups :
  • python :
  • terraform :

View the guardrails, recommendations and best practices here

Data Ingestion#

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Introduction#

Landing Zone and Bronze#

Definition of Ingestion : Ingestion is the process of obtaining data from an external system and storing it in the data platform in exactly the same structure and content as the source system. No modifications / transformations of any kind should be done during ingestion. This ingestion topic usually covers source to landing zone and landing zone to bronzes.

There is no silver bullet tool that will cover every scenario. Evaluate these tools and make a decision on tooling that supports the use cases and engineering preferences.

Using an external tool, you obtain key benefits as listed below. By implementing python you spend a lot of effort writing and wiring in your own plumbing that the existing frameworks provide.

  • Depending on the different sources, the data ingestion tool is selected by te development team.
  • Each ingestion pipeline must have structured exception handling.
  • must support a retry model from where the pipeline failed.
  • reconciliation / verification on completeness must be performed.
  • events to be raised to notify orchestration / external system that ingestion has been completed.
  • To handle maximum through and support multiple parallel ingestion's

Below is a list technologies and tools pertaining to the ingestion topic. These tools can be used to move the data from source to landing zone and bronze layer. Use one of these tools to move data into the landing first.

  • Mulesoft : NovoNordisk Standard for API Ingestion.
  • ADF : Currently not used but offers advanced robust connectors for many datasources that can be utilised.
  • Airbyte : opensource based ETL tool offering many advanced ingestion capabilities including logging, auditing, DQ and observability.
  • fiveTran : Low code ingestion tool - relatively expensive.
  • Databricks : Part of the eco system and can be directly used for ingestion, good for native databricks implementation.
  • DLT Hub : Opensource python based ETL library offering many advanced ingestion capabilities including logging, auditing, DQ and observability.

The selection of the ingestion tools depend on the organization budget and development preferences.

Ingestion recommendations#

  • Use both a landing zone and bronze layer, the pattern for moving from landing zone to bronze will be very similar and lends itself to building re-suable templates. Click here
  • Ingestion to landing zone is triggered / scheduled by external job or using ADF
  • Data movement from landing zone to bronze is done automatically using either autoloader (using serverless) or DLT.
  • If you are are using file types and ingestions that are suited to autoloader, it is recommended to use autoloader to reduce custom implementations where possible.
  • Exceptions : In some cases batch processing can be more cost efficient if data being received by infrequent loads and you need to use the platform only for limited amount of time, e.g once a week / once a day, it may be more efficient to run the cluster once a week.
  • By design the data movement of different data tables / documents is independent of each other, e.g they can be run in parallel.
  • bronze layer to have an additional set meta data columns, columns are the same for all tables, and to be updated automatically during the ingestion processes - view an example here

Gaps addressed by the target architecture :#

Click here view the engineering gaps and recommendations.

Data Transform#

Transformation Options#

There are several options available within Databricks for transformation. There are two types of data transformations: declarative and procedural.

  • Python - This is the most widely used approach for current data transformations. This is the procedural approach.
  • DLT - Not Used This is the simplest way to implement data transformations but it doesn't cover all use cases. Our recommendation is to use DLT when its applicable for your specific use case. refer to [this] () page for target architecture. This is the declarative approach.
  • DBT Framework - Not Used This is the declarative approach.

https://docs.databricks.com/aws/en/data-engineering/data-transformation Declarative data transformation using DLT is best when:

You require rapid development and deployment. Your data pipelines have standard patterns that do not require low-level control over execution. You need built-in data quality checks. Maintenance and readability are top priorities. Procedural data transformation using Apache Spark code is best when:

You are migrating an existing Apache Spark codebase to Databricks. You need fine-grained control over execution. You need access to low-level APIs such as MERGE or foreachBatch. You need to write data to Kafka or external Delta tables.

Transformation for Batch#
Transformation for Near Realtime#

Data Quality Management#

  • Data type conversations
  • Frameworks
  • managing data quality rules

Change Management#

  • changing the data model - silver / gold
  • handling change in business requirements

Deduplication#

  • Merging of records

Error Handling#

Historisation#

Historisation refers to the process of capturing and maintaining historical versions of data records. This is crucial for tracking changes over time , business purposes , auditing, data quality control and ensuring data integrity.

We suggest to identify the type of required historisation for every table at the time of data modelling / data product definition, historisation types should be recorded as metadata (e.g no-historisation, historisation based on data load timestamps, historisation based on business validity attributes (e.g address), bi-temporary historisation (2x dimensional) )

Approach:

  • Define historisation type for each table.
  • Document the requirements including the attributes for identifying validity of the record.
  • Use standard names for these attributes.
  • Implement a developer template to cater for the different types of historisation.
  • Be aware that formal referential integrity constraints cannot be used to verify integrity on history data. *this would only apply to the latest record only.

Retention Policies: Define policies to manage how long historical data is kept. This helps balance storage costs and compliance requirements.

Primary Key Management#

As a recommendation we would recommend to introduce data source agnostic primary keys for the silver layer.

Monitoring#

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Data Orchestration#

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Source to Landing#

  • Schedule : Databricks jobs or ADF triggering a pipeline - pre determined timeslots to run jobs.
  • Event driven : When the source publishes a signal or event that triggers a pipeline.
  • Data Streaming from Kafka : within databricks or chosen ETL tool
  • Constantly Monitor source for readiness : Anti Pattern - Not the preferred choice.

Landing to Bronze#

  • Trigger automatically upon file arriving

Data Serving#

![screenshot]/../assets/to-be-architecture/nn-datacore-architecture_overview_gaps_serve.drawio.svg)

Serving through API(mulesoft)#

Exporting to Interface Files#

Publishing Data Products to Marketplace#

Publishing data changes to Kafka#

Databricks SQL#

Data Sharing#

Common Components#

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Batch Job to clean landing zone data Batch job to clean bronze layer.

Governance Components#

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Data Quality#

https://databrickslabs.github.io/dqx/docs/motivation/

Below are the top 4 tools to consider when embedding data quality into your pipelines:

Framework Overview Key Features Integration with Databricks
Great Expectations Open-source framework for data validation, profiling, and documentation View here - Define data expectations in a human-readable format
- Automate data quality checks
- Generate data documentation View here
- Integrate with Databricks to validate data within ETL pipelines View here
Deequ Library built on Apache Spark for scalable data quality checks View here - Calculate data quality metrics
- Define and verify data quality constraints
- Anomaly detection and data profiling View here
- Integrate with Databricks to leverage Spark for large-scale data validation View here
Soda SQL Open-source tool for data testing, monitoring, and profiling View here - Write data quality checks using SQL
- Monitor data quality with automated alerts
- Extract metrics and column profiles View here
- Use with Databricks to run data quality checks on Delta Lake View here
Databricks Labs DQX Python-based framework for data quality checking at scale View here - Data profiling
- Generate data quality rules
- Validate data during processing
- Define criticality levels View here
- Seamlessly integrate with Databricks workflows for efficient data validation View here

These frameworks provide robust tools for ensuring data quality and validation in your Databricks environment, each with its unique strengths and integration capabilities.

Monitoring#

Pipeline Execution Monitoring#

Data Quality Monitoring#

Pipeline Performance Monitoring (telemetry)#

Data Access Monitoring#

  • out of scope

Cost Monitoring#

  • out of scope

Infrastructure Monitoring#

  • out of scope