As Is Architecture - Architecture Description#

As Is Architecture - Architecture Description
Version Control
Architecture Description
Data Source Types
- Naming Standards
Data Landing Zone
- Repo's used for ingestion evaluation :
Data Ingestion
Data Transform
Data Serve
Orchestration and Scheduling
Common Components

Version Control#

Version	Date	Owner	Change Description
0.1	18 March 2025	Gareth Stretch	Initial Framework created
0.2	15 April 2025	Gareth Stretch	Populated Ingest and Transform
0.3	23 April 2025	Gareth Stretch	Added Snowflake Ingestion

Architecture Description#

The architecture description documents the current state of DataCore in terms of tooling and processes aligned to the framework pillars (Ingest, Transform, Serve, Orchestrate and Common Components)

Data Source Types#

The data source types section lists the current data types are the are supported by DataCore. This is not a list of source systems but rather the categories of connection types and data formats.

Data Source Types :

The following data source types / document types / file types are supported through the ingestion processes.

Databricks --> Databricks can also serve data, eg this data in the context of data ingestion reads data from a databricks endpoint, This data lands directly to the bronze layer by-passing the landing zone ( Not Used Yet ).
Database --> Data read from a database is processed using Python scripts and skips the landing zone and goes straight into the lake as a parquet file using the data lake.
Json / XML / Excel / CSV / Text / Parquet --> These files are landed in the landing zone for storage based on life cycle policy, after which the file is processed and ingested in the bronze lake.
API --> Data read from API's is read directly to the landing zone in the original format.
KAFKA --> Data read from kafka lands directly in the landing zone is the original format.

Interface Types :

ODBC/JDBC - Connects to database source systems for ingestion database data
SFTP - Currently used for batch ingestion from local and cloud object storage into the databricks processing engine.
WebServices - Used for data ingestion from external systems. Not Used
Rest API's - Used for data ingestion from external systems. This data lands in the landing zone.
Kafka - Used for data ingestion from Kafka. This data lands in the landing zone. usually connected to the confluent Kafka API endpoint.

Types of Transmission :

Full Load : Full Loads are currently performed on file and database ingestions (they don't have the mechanism to support soft deletes from source system)
Incremental Load: Incremental loads are currently performed using date fields as watermark values
CDC : CDC is a technique for capturing changes in data sources and applying them to target systems. Not Used

ETL Method : listing ELT tools compared to what is currently being used. 1. Delta Live Tables (DLT): Not Used 2. Auto Loader: Not used Auto loader was tried and tested, presented challenges with complex file type, some files were zipped and difficult get working. 3. Python : Used for data ingestion and transformation. varying implementations across different ingestion pipelines. 4. SQL : Used mostly for creating objects in a data catalogue. SQL language is used far less than python. 5. ADF : Not used

Naming Standards#

The following should be documented as part of a data platform naming conventions. This section documents the naming standards as currently set out for the DataCore framework. Content provided based on evidence found. These include the following areas.

Landing zone folder structure : no documented / proposed structure. E.g ingestion pipeline is treated on a case by case basis.
Git repos, pipelines : https://datacore-docs.azurewebsites.net/development/adr/accepted/0002-repo-names/#proposal
Database Objects (databases and Schema's) : https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-datanowcdb/documentation/design_specification/#ds-dc-ingest-datanowcdb-4-landing-to-bronze
Tables and Views : No documents found
Workspaces : here
Columns: No documents found
Indexes and Constraints : No documents found
Stored Procedures and Functions : No documents found
Triggers : No documents found
Databricks jobs : No documents found
Security and roles : https://datacore-docs.azurewebsites.net/resources/explanation/access/#datacore-role-naming-convention
Service principles : hereand here
Resource groups : https://datacore-docs.azurewebsites.net/development/adr/draft/0003-tag-rgs-with-origin/#resource-group-naming_1

Findings :

DataCore roles / access groups has a documented naming convention:
Repositories proposal was submitted. Not fully adhered to. no evidence of conclusion.
Reference to databricks objects based on databricks recommendation. https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-names

Click here to view the documented gaps for naming standards.

Data Landing Zone#

In some cases, some ingestion pipelines make use of a landing zone and in some cases there is no landing zone. At present the landing zone data is stored in a separate subscription from that of the bronze layer. There is no documented guardrails pertaining to when or when not to use a landing zone, however from observation it appears that sources such as KAFKA, API and File use a landing zone and sources like SQL / Database

There are landing zone designs and pages containing definitions of the landing zone. Additionally there are pipelines designed around the landing principles.

In the context of the Azure Databricks deployment, the following is noted regarding the landing zone design.

Azure Gen2 Storage account for the landing zone.
The landing zone contains a folder called /landing/push-ingest for the source to push to. ("Landing Source")
The landing zone contains a folder called /landing/ ("Landing Process")
The landing zone folder structure considers failed and success structures
The Databricks pipeline ingesting the data has a first task, which is responsible for tracking and processing new data files coming in, using a state table.
If the file is in parquet there is no value in landing in the landing zone. (Databricks recommended)

/Volumes/{catalog}/{schema}/{provider}_volume/
├── table_ingest/
│   ├── succeeded/
│   │   └── YYYYMMDDTHHMMSS/  # Timestamp-based folders for successful ingestion's
│   └── failed/
│       └── YYYYMMDDTHHMMSS/  # Timestamp-based folders for failed ingestion's
└── volume_ingest/
    └── YYYYMMDDTHHMMSS/      # Timestamp-based folders for files arrived in SFTP

Repo's used for ingestion evaluation :#

Code and technology usage where performed using the links below.

https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-external-data/documentation/design_specification/#ds-dc-ingest-external-data-2-source-to-landing.
https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-edc/documentation/design_specification/#ds-dc-ingest-edc-3-az-copy-script-to-copy-historical-edc-data-from-cdr-to-datacore-landing-zone
https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-hot/documentation/design_specification/
https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-unixsce/documentation/design_specification/#ds-dc-ingest-unixsce-4-landing-to-bronze-task
https://dev.azure.com/novonordiskit/StudyHub%20-%20Data%20and%20Modelling/_git/data_modelling

Click here to view the documented gaps for landing zone standards.

Data Ingestion#

Hold "Alt" / "Option" to enable Pan & Zoom

Definition Bronze Layer : Raw data in delta format - The Bronze layer in the Medallion Architecture is the foundational stage where raw data is ingested and stored. The Bronze layer is designed to capture raw data from various source systems in its original format. Apply minimal transformations, primarily focused on data ingestion and storage. This includes adding metadata columns like load date/time and process ID This layer serves as the initial landing zone for all incoming data.

Maintains the raw state in the structure “as-is”
Data is immutable (read-only)
Delivery-based partitioned tables, i.e., YYYYMMDD
Mostly Parquet. Sometimes other formats
Can be any combination of streaming and batch transactions
May include extra metadata (schema)
May be fed from a “mediation layer”
Used for debugging, testing

Please not that silver and gold layer (transformation) is covered here as they form part of transform and not ingest.

Databricks offers several options for data ingestion pipelines, each suited to different needs: 1) Batch Ingestion: - Traditional Batch Ingestion: Load data in bulk at scheduled intervals or manually. - Incremental Batch Ingestion: Automatically detect and process new records, reducing the load on compute resources.
2) Streaming Ingestion: - Streaming Tables: Use SQL commands to load data incrementally from cloud storage1. - Auto Loader: Efficiently process new data files as they arrive in cloud storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage1. 3) Delta Live Tables (DLT): - DLT Pipelines: Create and manage batch and streaming pipelines using SQL and Python23. DLT handles dependencies and automates data quality checks and transformations

This section covers the data ingestion patterns supported as seen by reviewing various pipelines found in github. Ingestion by definition covers the source to landing and landing to bronze.

1 - SFTP / File:#

Example 1 : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/src/etl/config.py Example 2 : https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-unixsce/documentation/design_specification/#ds-dc-ingest-unixsce-4-landing-to-bronze-task Example 3 : https://github.com/NovoNordisk-DataCore/dc-ingest-edc Example 4 : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/documentation/design_specification.md

The engineer uses python to load the data from SFTP server into the landing zone. This code runs runs directly on teh databricks cluster and is scheduled as a periodic job.

The ingestion is implemented as a Databricks job containing two tasks written in Python. For transfer of files between the UNIX SCE SFTP server, the Landing Zone, and the Bronze Layer, the following libraries are used:

Area	Example
Types of Data Source	SFTP
Formats	CSV, XML
Ingestion - Batch	Full load of files
Target Storage	Delta Table
Data Validation	Verify the file names to ensure they meet the agreed-upon pattern with TSS,Verify if there are any files to be ingested verify if the file to be ingested is not empty
Error Handling	try/catch / fail pipeline error handling.
Triggering Mechanism	Databricks Job
Frameworks Used	functools, Paramiko for SSH/SFTP operations, Cerberus for validation of configuration files, Databricks Utilities (dbutils) for Azure Blob Storage and Unity Catalog Volumes operations
Languages	Python, SQL

2 - Database : Findings for projects using SQL / Oracle as a source#

SQL / database ingestions skip the landing zone and goes straight into bronze.

Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-dmw?path=/src/etl/main.py Example 2 : https://github.com/NovoNordisk-DataCore/dc-ingest-cdw

Area	Example
Types of Data Source	Oracle , SQL , MongoDB
Formats	external relational database
Ingestion - Batch	Full Load with override
Target Storage	Delta Table
Data Validation	None
Data Transformation	Complex querying from source and landing technically a "silver" dataset into bronze.
Error Handling	None, only basic logging.
Triggering Mechanism	Databricks jobs
Frameworks Used
Languages	Python, SQL

In the scenario above, the way the code is structured is query the source and creating a silver product without landing from source into a bronze. Additionally the queries are making use of lambda functions and regex functions in the joins making the job potentially expensive and more complex.

3 - KAFKA :#

Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-kafka?path=/documentation/design_specification.md&_a=preview

Area	Example
Types of Data Source	Cosmos DB, Kafka Event Streams through API
Formats	Json
Ingestion - Streaming / Realtime	Structured Spark streaming, The topic message json strings are transformed to tabular format by denormalizing the json data.
Target Storage	Spark DataFrames to Delta tables , A merge operation is used to ingest data, ensuring that only the latest version is available
Data Validation
Error Handling	For each streaming task, logging is enabled. All tables are logged individually when starting and completing a streaming activity
Triggering Mechanism	Python code is deployed to run as a job in Databricks. When triggered, the job executes a task per source system, streaming from the source topics to the bronze layer.
Frameworks Used	confluent kafka, pylint, black, mypy, bandit
Languages	Python, SQL

4 - API : Findings for projects using API as a source#

There are two approaches to storing API Data.

1) Pattern 1 - Push to Landing

API is exposed through Mulesoft and data providers call the API. The raw file(usually JSON or XML) is placed in the landing zone for processing. This pattern is currently support in the AWS environment and Muleoft is not deployed in azure.

2) Pattern 2 - Pull to Landing

A custom integration to a 3rd party API lands the data in raw format in either the landing zone or bronze, this is performed using python.

Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-snow?path=/src/dc_ingest_snow/etl.py Example 2 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-studybuilder

Area	Example
Types of Data Source	Rest / API / Json
Formats	JSON / XML
Ingestion - Batch	Full loads
Target Storage	Delta Table - Full loads The full load of the SB API data to the bronze layer is performed four times per day
Data Validation
Error Handling
Triggering Mechanism	Databricks jobs
Frameworks Used	aioresult, pydantic_core
Languages	Python

5 - Snowflake : Findings for projects using Snowflake as a source#

Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-cdp?path=/documentation/design_specification.md&_a=preview

Recovering from Pipeline errors :#

Introduction : When a data pipeline fails in Databricks, there are several options available for reloading data to ensure consistency and minimize downtime:

Automatic Retries - Delta Live Tables (DLT): You can configure automatic retries for DLT pipelines using properties like pipelines.maxFlowRetryAttempts and pipelines.numUpdateRetryAttempts1. These settings allow the pipeline to retry a specified number of times before failing completely.
Checkpointing - Streaming Queries: Enable checkpointing for streaming queries. This ensures that the query can restart from the last successful checkpoint, minimizing data loss2.
Manual Reload - Full Refresh: Manually trigger a full refresh of the target tables. This is useful if the pipeline has been down for an extended period and you need to ensure data consistency.
Error Handling and Alerts - Custom Error Handling: Implement custom error handling within your pipeline code to catch and log errors, and potentially trigger alerts or retries based on specific conditions. Monitoring and Alerts: Use Databricks monitoring tools to set up alerts for pipeline failures. This allows you to quickly respond and take corrective actions3.
Data Validation and Quality Checks - Data Quality Rules: Implement data quality rules within your pipeline to validate data before it is processed. This can help catch issues early and prevent pipeline failures.

This section describes how pipeline errors / failures are currently handled.

Most ingestion pipelines use databricks jobs for scheduling ingestions. Retry policies are used (which includes max number of retries and interval between retries). Additionally , there is error handling in the pipelines by means of Try-Catch blocks. There is no consistency in how these are implemented, varying techniques exists. In most cases the try catch logs the error to the logger and bubbles up and the entire pipeline fails. A comprehensive retry strategy needs to be documented and adhered to.

Gaps for Ingestion :#

In general, much of the code and plumbing is hand-rolled and manually written in python by the engineers. Using standard spark libraries will provide for native built in support for database connectivity, querying, configuration management SQL Execution, basic logging, auditing, error handling capabilities.

DDL statements are written directly in the code as part of the ETL process. e.g creation of catalogue, schema ,table : This practice is generally not a good practice. This should be extracted out of the python code and have seperate jobs that as part of the deployment create the required objects.

Click here to view the gaps identified for ingestion.

Data Transform#

Hold "Alt" / "Option" to enable Pan & Zoom

Definition - Transformation : The process of moving the data from bronze to silver to gold. Data transformation is the process of converting, cleansing, and structuring data into a format of the target layer (usually delta tables).

Definition - Silver : Filtered, cleaned and augmented - The Silver layer in the Medallion Architecture is designed to cleanse, conform, and standardize data from the Bronze layer (raw data). This layer ensures that the data is accurate, consistent, and ready for further analysis and reporting.

Uses data quality rules for validation
Usually only functional data
Historization is merged (SCD2)
Efficient storage format; Delta
Versioning for rolling back
Handles missing or incorrect data
Usually enriched with reference data
Source-oriented, although queryable and cluttered around subject areas
Usually used by operational analytical teams

Definition - Gold : Business-level aggregates - The Gold layer in the Medallion Architecture is the final stage where data is highly refined and optimized for business intelligence, analytics, and machine learning. The Gold layer is designed to provide curated, aggregated and optimized datasets that are ready for consumption by business users, analysts, and data scientists.

What enterprises call data products: consumer-ready / user-friendly data
Data is highly governed and well-documented
Historization is applied only for the set of use cases or consumers
Contains complex business rules, such as calculations and enrichments
Efficient storage format; Delta
Versioning for rolling back
Might contain additional sub layers for sharing or distributing data

This section lists below the content for data transformation techniques and tools in general as implemented by Novo Nordisk, it covers silver and gold where applicable.

Transformation Tooling#

There are several options available within Databricks for transformation. There are two types of data transformations: declarative and procedural.

Python - This is the most widely used approach for current data transformations. This is the procedural approach.
DLT - Not Used This is the simplest way to implement data transformations but it doesn't cover all use cases. Our recommendation is to use DLT when its applicable for your specific use case. refer to this page for target architecture. This is the declarative approach.
Spark Structured Streaming
Notebooks : DataCore have decided that executing production workloads should be done using Databricks jobs, rather than Databricks notebooks (or scheduled notebooks)Click here
Workflows
Databricks Machine Learning
DBT Framework - Used by StudyHub This is the declarative approach. Click here

Transformation for Batch#

Currently the most widely used write mode is "Overwrite" with a fully dataset being stored in silver.
CDC - Change data capture is not currently used.

Transformation for Near Realtime#

Structured streaming / Streaming tables is not used.

Data Quality Management#

The data quality management is usually done on stages of the pipeline (landing to bronze, bronze to silver, silver to gold), however the main quality validation processing is performed on the load to silver as one of the main goals of silver is to deliver curated data. The section below documents the findings for the silver layer.

Schema Enforcement : This process happens when the data loaded into bronze. Only basic validation of the data is performed using python.
Constraints: Not Used / Didnt find a plan / strategy pertaining to contstraint management The constraints guarantee the validity of the data but they require extra processing which leads to extra load time and extra cost.
Data Quality Rules : DQX The data rules should be documented in a data contract and then translated into constraints or expectations.Not Used
Monitoring and Alerts : Discuss the tools and processes for monitoring data quality and setting up alerts for anomalies.
Data Quality Framework options
Native - Databricks - DLT Not Used
Native - Databricks - Python - Great expectations Not Used

Change Management#

Change management is the process to modifying a data-pipeline to accommodate either new business requirements or changes in the datasources. One of the biggest challenges in data management is to ensure that existing code will not be broken and new bugs will not be introduced. For efficient change management automated regression testing is key.

Every change has to be assessed based on the impact to database structure schema. Small changes can be implemented by simple alter statements and small additions / changes of pipeline logic. There are changes that may require recreations of tables and as a result reprocessing or reloading of historical data. These changes usually require a dedicated project.

The art of the initial design is to minimize the need for the changes of the second type mentioned above.

Changes fitting the small change category should be separated from the changes fitting the larger change category and in the target state, a detailed change management process should be design which would include automating the process as much as possible.

Deduplication#

The process of de-duplication is avoiding the creation of duplicate records in the target layer. even when the same data is sent to the data platform, it will be inserted into the target layer only once. The most important part of de-duplication is defining the criteria to ensure a record is unique, this begins with an understanding of the data product requirements.

Case 1 : We consider records the same if they have the same primary key fro the source system, this method is used when we are not interested in only the latest version of the record.
Case 2 : We consider records the same if the primary key and a set of business attributes are the same.

Currently the process of de-duplication is by means of "Select DISTINCT" with full table reloads, this will not work for delta loads and CDC. Also "Select DISTINCT" compares all columns defined in the query which may not be the correct criteria of all use cases.

Options to consider : Develop a re-usable template that engineers use to configure and derive uniques for a given dataset. We can add to every dataset and new column (Calculated hashed value ) based on set of attributes that define uniqueness of the record which is then used for uniques checks. This can be defined as UDF which is stored in Unity Catalogue.

Error Handling#

There is a difference between error handling and data validation. Data validation is the process of finding an error / exception with the data record itself, error handling is the process to act upon the exception. There are two types of exceptions, critical and non critical. Critical exceptions lead to the termination of the pipeline and non-critical exception leads to recording reporting an exception. Some non critical exceptions can be automatically fixed by the cleansing process.

Example Pipeline : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/src/etl/source_to_bronze.py

Exception handling is performed differently across various projects. In most cases, if an exception is encountered, the pipeline terminates and closes any open connections.

Sample of Exception Handling below is an example of a critical exception which terminates the pipeline. The information of the error is logged to the logger.

  try:
        with ThreadPoolExecutor() as executor:
            conf["svcs"]["executor"] = executor
            result = run_workflow(conf)
    except Exception as err:  # pylint: disable=broad-except
        svcs["logger"].error(err, exc_info=True)
        result["error"] = "failed to run workflow"
    finally:
        sftp_close(conf)
    if "error" in result:
        raise RuntimeError("Task had error(s)")

- What is not defined is the clear separation of critical and non critical exceptions. When logging errors this allows you to catch errors, log them, categorize them and report on them. - Checkpointing is not used. This ensures that the pipeline can restart from the last successful checkpoint, minimizing data loss.

Historisation#

Historisation refers to the process of capturing and maintaining historical versions of data records. This is crucial for tracking changes over time , business purposes , auditing, data quality control and ensuring data integrity.

Currently there is limited Historisation on the bronze layer. Either the latest version of the record is available (no historisation) or data is just appended to the table without additional meta data required for efficient analysis of historical data.

From what has been reviewed, "overwrite mode" was used for full data reload.
Technically the previous version of the data is available and could be restored using databricks time travel, however this option is not designed for "real" Historisation and can rather be used for data recovery. This approach is not ideal for historical data analysis.

Primary Key Management#

Currently primary keys from Source systems are used as the primary key :

Pros's : - Simple implementation. Con's : - Heavy change management in case of source system change. - Keys across systems can collide. - Source systems sometimes only have composite keys which makes the record harder to use.

This is not an issue for DataCore as this is currently only for bronze.

Click here to view the transform gaps.
Click here to view the transform guardrails.

Data Serve#

Hold "Alt" / "Option" to enable Pan & Zoom

View Gaps here

Data serving involves making data available for consumption by applications, users, or other systems. This includes tools such as :

Databricks SQL: Provides a SQL-based interface for querying and serving data, ideal for analytics and reporting. This is the most commonly used approach in DataCore today. Delta Sharing: Delta Sharing is an open protocol for secure data sharing. It allows organizations to share data across different platforms and cloud providers in a secure and efficient manner .: This is not used currently although designs are in place for usage Databricks Model Serving: Allows you to deploy and serve machine learning models as endpoints, supporting both batch and real-time inference : This is not used currently. Query Interfaces via databricks API:

Serving through API#

Not Used - refer to target architecture here

Exporting to Interface Files#

Not Used :

Publishing Data Products to Marketplace#

Not Used While currently not used, POC's have been conducted and work is underway to agree on the processes aligned to data publishing into the Marketplace.

Publishing data changes to Kafka#

Not Used

Databricks SQL#

Serving data through Databricks SQL involves using Databricks SQL to query, analyze, and visualize data stored in your Databricks Lakehouse platform. This is the most frequently used option for Novo Nordisk.

Data sharing refers to the ability to share data securely with other users or organizations. Databricks provides several mechanisms for data sharing, including:

Delta Sharing: An open protocol developed by Databricks for secure data sharing with other organizations, regardless of the computing platforms they use. It allows you to share live data sets, models, dashboards, and notebooks across platforms, clouds, and regions.

Cross Regional Delta sharing is not currently used, however there are plans in place to enable this.

Unity Catalog: Facilitates data sharing by providing centralized governance and fine-grained access controls. It ensures that data sharing is secure and compliant with organizational policies.

Data Sharing Mechanisms today:

Delta Sharing + Cross region: : A data owner creates a share in Unity Catalog, which can include tables, table partitions, views, materialized views, volumes, notebooks, and AI models and then grants access to the share to one or more recipients. NovoAccess enabled Entra-ID groups are used to manage access to data.

The Real World Data Assets team will manage the access to the data. [view here(https://datacore-docs.azurewebsites.net/development/multi-region/#region-to-region-data-sharing)] Delta Sharing currently remains with the DataCore team, as it requires elevated permissions to create shares. - Databricks-to-Databricks Sharing: Databricks-to-Databricks Sharing is designed for sharing data and AI assets between Unity Catalog-enabled Databricks workspaces : Not Used. - Open Sharing Protocol: The Databricks Open Sharing Protocol, part of Delta Sharing, is designed to facilitate secure and real-time data sharing across different platforms and organizations. This is being explored here

Data Governance and Security:

Unity Catalog: Unity Catalog is used to manage and govern shared data, including access controls, auditing, and usage tracking.
Access Controls: DataCore has been managing access to data via RBAC model in dc-interface-agreements. This approach is intended for system to system access, where a system is a single identity. While DataCore itself does not have end users due to its infrastructure nature, business applications built on top of DataCore do have end users (persons), and do need to establish a robust access model for them
Encryption: Encryption is enabled at the storage layer : Data encryption at rest with AES-256 and full disk encryption.

Data Sharing Use Cases:

Internal Collaboration: Outline how data is shared within the organization to facilitate collaboration between different teams and departments.
External Collaboration: Document use cases where data is shared with external partners, clients, or vendors for joint projects or data-driven initiatives. 4. Data Sharing Tools and Interfaces:
Databricks Marketplace: Currently in design and pending implementation.
Clean Rooms: : Databricks Clean Rooms are secure, privacy-centric environments designed for collaborative data analysis and AI without providing direct access to raw data Clean Rooms is not used

Monitoring and Auditing:

Usage Tracking: Document the tools and methods used to track the usage of shared data, including who accessed the data and how it was used.
Audit Logs: Databricks logs across the different nodes are collected for job runs, pipeline runs, there is no central dashboard collating this logging information.

Click here to view the serve gaps.
Click here to view the serve guardrails.

Orchestration and Scheduling#

View Gaps here

Orchestration and scheduling in Databricks involve coordinating and managing the execution of data processing tasks and workflows to ensure they run efficiently and reliably.

Orchestration : Orchestration in Databricks refers to the process of automating and managing complex workflows that involve multiple tasks. These tasks can include data ingestion, transformation, and machine learning model training. (Jobs, Tasks, Control Flow and Integration)

Scheduling : Scheduling in Databricks involves setting up jobs to run at specific times or intervals. This ensures that data processing tasks are executed automatically without manual intervention. (Triggers, Parameters and Notifications)

*Databricks Tools for Orchestration and Scheduling: *

Databricks Jobs : Jobs and tasks are currently used in the context of DataCore Ingestion orchestration and job scheduling, These are generally well documented within each ingestion project repository. Production workloads must be executed on Databricks' jobs, as opposed to Databricks notebooks (or scheduled notebooks). These are documented here for production workloads
Delta Live Tables (DLT) : DLT allows you to define and manage data pipelines with built-in data quality checks and monitoring : Not Used
Databricks Workflows : These are used with very basic implementation.
Integration with External Tools : Databricks can integrate with Apache Airflow, Dagster for more advanced workflow automation and management and For users on Azure, Databricks can be integrated with Azure Data Factory to orchestrate data workflows across various services, however the view on Datafactory is that it is a bloated, buggy tool and the need for an opensource tool is a better option : Not Used :

Click here to view the orchestration gaps.
Click here to view the orchestration guardrails.

Common Components#

Common Components are components that every data platform should have, these common components should be unified tool, templates and processes that each engineering team can leverage in order to affectively implement data solutions and write less code. The components include

Data Validation Components
Exception Handling Components
Pipeline Monitoring Components
Testing Components
Devops Components
Data Maintenance Components

They are discussed below in detail.

Data Validation#

The topics below are considered essential with working with data in databricks. This section lists each topic and then explains the current usage.

Schema Enforcement#

Description : Ensure that data written to a table adheres to a predefined schema. This helps in maintaining data quality by enforcing schema and constraint checks on write .
Implementation : Use Databricks Delta Lake to enforce schema rules and constraints.
As-Is Notes :

Table Constraints#

Description: Apply constraints such as NOT NULL, CHECK, primary key, and foreign key constraints to ensure data integrity.
Implementation: Use SQL commands to add constraints to tables.
As-Is Notes :

Define and Configure Expectations Using Delta Live Tables (DLT)#

Description: Set expectations for data quality when declaring materialized views or streaming tables. Configure these expectations to warn about violations, drop violating records, or fail workloads based on violations.
Implementation: Use DLT to define and manage data quality expectations.
As-Is Notes :

Data Monitoring#

Description: Continuously monitor data quality and performance over time to detect and address issues.
Implementation: Implement monitoring tools and dashboards to track data quality metrics.
As-Is Notes :

Cast Data Types#

Description: Ensure that data types are correctly cast to maintain consistency and prevent errors.
Implementation: Use SQL commands to cast data types appropriately.
As-Is Notes :

Custom Business Logic#

Description: Define custom rules and logic to validate data according to specific business requirements.
Implementation: Implement custom validation logic using SQL or Python.
As-Is Notes :

Pattern Enforcement#

Description: Use regular expressions (regex) to enforce expected patterns in data fields, ensuring data adheres to specific formats.
Implementation: Use REGEXP or RLIKE functions in SQL to enforce patterns.
As-Is Notes :

Value Enforcement#

Description: Enforce value ranges on columns to ensure only valid values are inserted or updated.
Implementation: Use CHECK constraints in SQL to define valid value ranges.
As-Is Notes :

Handling Null or Missing Values#

Description: Enforce NOT NULL constraints to prevent null values in critical columns.
Implementation: Apply NOT NULL constraints to columns where null values are not acceptable.
As-Is Notes :

Data Validation Strategy#

Description: Develop a comprehensive strategy for data validation, including user acceptance criteria and test strategies.
Implementation: Define validation steps and criteria early in the project lifecycle.
As-Is Notes :

Exception Handling#

Custom Error Handling Functions#

Description: Utilize custom error handling functions to capture and log exceptions, enabling better troubleshooting and analysis.
Implementation: Define custom functions in your code to handle specific types of errors and log detailed information.
As-Is Notes :

Automatic Retry Mechanisms#

Description: Implement retry logic to handle transient failures gracefully and ensure job resilience.
Implementation: Use libraries like retrying to automatically retry operations that fail due to temporary issues.
As-Is Notes :

Comprehensive Logging#

Description: Integrate logging functionality to record detailed error messages, stack traces, and execution context for improved debugging.
Implementation: Use logging frameworks to capture and store logs systematically.
As-Is Notes :

Try-Catch Blocks#

Description: Use try-catch blocks to handle exceptions and prevent them from causing the entire job to fail.
Implementation: Wrap critical code sections in try-catch blocks to manage exceptions locally.
As-Is Notes :

Graceful Degradation#

Description: Ensure that your application can continue to operate in a degraded mode when certain errors occur.
Implementation: Design your workflows to handle partial failures and continue processing unaffected parts.
As-Is Notes :

Alerting and Monitoring#

Description: Set up alerting mechanisms to notify you when exceptions occur, and monitor the system for unusual patterns.
Implementation: Use Databricks monitoring tools and integrate with external alerting systems like PagerDuty or Slack.
As-Is Notes :

Exception Handling in Notebooks#

Description: Include exception handling in notebooks to manage errors during interactive data analysis.
Implementation: Use try-except blocks in notebooks and ensure proper error handling for each cell.
As-Is Notes :

Resource Cleanup#

Description: Ensure that resources are properly cleaned up after an exception occurs to prevent resource leaks.
Implementation: Use finally blocks or context managers to release resources like database connections or file handles.

Custom Exception Classes#

Description: Define custom exception classes to represent specific error conditions in your application.
Implementation: Create custom exceptions that inherit from the base exception class and use them to signal specific error conditions.
As-Is Notes :

Documentation and Training#

Description: Document your exception handling strategy and train your team on best practices.
Implementation: Maintain comprehensive documentation and conduct training sessions to ensure everyone understands how to handle exceptions effectively.
As-Is Notes :

Data Pipeline Monitoring#

Testing#

In the context of creating data pipelines and data products within Databricks, testing refers to the systematic process of validating and verifying that the data workflows, transformations, and integrations function correctly and meet the desired quality standards.

This involves several key activities, including unit testing to ensure individual components work as expected, integration testing to verify that different parts of the pipeline interact correctly, and end-to-end testing to validate the entire data flow from ingestion to final output. Additionally, testing encompasses data quality checks to ensure the accuracy, consistency, and completeness of the data, performance testing to assess the efficiency and scalability of the pipelines, and regression testing to confirm that new changes do not introduce errors or degrade existing functionality.

Effective testing strategies also involve the use of continuous integration/continuous deployment (CI/CD) pipelines to automate testing and deployment processes, comprehensive logging and monitoring to track pipeline performance and detect issues, and the use of version control to manage changes and maintain traceability. Overall, testing in Databricks ensures that data pipelines and products are reliable, maintainable, and capable of delivering high-quality data to support business needs.

Unit Testing#

Description: Implement unit tests to validate individual functions and components of your code .
Implementation: Use frameworks like pytest for Python, test that for R, and ScalaTest for Scala. Write unit tests within notebooks to ensure code correctness early in the development process.
As-Is Notes :

Integration Testing#

Description: Perform integration tests to ensure that different components of your pipeline work together as expected .
Implementation: Test the interactions between various modules, including data ingestion, transformation, and loading processes.
As-Is Notes :

End-to-End Testing#

Description: Conduct end-to-end tests to validate the entire data pipeline from raw data ingestion to final data delivery.
Implementation: Simulate real-world scenarios and workflows to ensure the pipeline operates correctly under expected conditions.
As-Is Notes :

Data Quality Checks#

Description: Integrate data quality checks to verify the accuracy, consistency, and completeness of data.
Implementation: Use Delta Live Tables (DLT) to define and enforce data quality expectations, and monitor data quality metrics.
As-Is Notes :

Performance Testing#

Description: Conduct performance tests to evaluate the efficiency and scalability of your data pipelines.
Implementation: Measure execution times, resource utilization, and throughput to identify bottlenecks and optimize performance.
As-Is Notes :

Continuous Integration/Continuous Deployment (CI/CD)#

Description: Set up CI/CD pipelines to automate testing, integration, and deployment processes.
Implementation: Use GitHub Actions, or Azure DevOps to automate the build, test, and deployment workflows.
As-Is Notes :

Version Control#

Description: Use version control systems to manage changes to your code and pipeline configurations.
Implementation: Integrate with Git to track code changes, collaborate with team members, and maintain a history of modifications.
As-Is Notes :

Data Maintenance and Archiving#

Devops#

View here

View Gaps here

Data Quality Monitoring#

Implementing data quality monitoring in Databricks involves leveraging several built-in features and tools to ensure the integrity, accuracy, and reliability of your data, eg Statistical Properties: Track data integrity, distribution, and drift over time. while the data quality metrics within each ingestion is defined in the functional specification, there is no consistent pattern or template for affectively measuring data quality. things like consistency, accuracy, validity, completeness, timeliness and uniqueness. As currently observed, these topics are address directly within each implementation in pockets. E.g there may be a select DISINCT clause in the SQL query, but no end to end data quality monitoring framework or implemenation.

Data Access Monitoring#

Audit Logs: Monitor user activities, workspace changes, and security events - out of scope

Cost Monitoring#

out of scope

Infrastructure Monitoring#

out of scope

View Gaps for Common Components here
View Guardrails for Common Components here

As Is Architecture - Architecture Description#

Version Control#

Architecture Description#

Data Source Types#

Naming Standards#

Data Landing Zone#

Repo's used for ingestion evaluation :#

Data Ingestion#

1 - SFTP / File:#

2 - Database : Findings for projects using SQL / Oracle as a source#

3 - KAFKA :#

4 - API : Findings for projects using API as a source#

5 - Snowflake : Findings for projects using Snowflake as a source#

Recovering from Pipeline errors :#

Gaps for Ingestion :#

Data Transform#

Transformation Tooling#

Transformation for Batch#

Transformation for Near Realtime#

Data Quality Management#

Change Management#

Deduplication#

Error Handling#

Historisation#

Primary Key Management#

Data Serve#

Serving through API#

Exporting to Interface Files#

Publishing Data Products to Marketplace#

Publishing data changes to Kafka#

Databricks SQL#

Data Sharing#

Orchestration and Scheduling#

Common Components#

Data Validation#

Schema Enforcement#

Table Constraints#

Define and Configure Expectations Using Delta Live Tables (DLT)#

Data Monitoring#

Cast Data Types#

Custom Business Logic#

Pattern Enforcement#

Value Enforcement#

Handling Null or Missing Values#

Data Validation Strategy#

Exception Handling#

Custom Error Handling Functions#

Automatic Retry Mechanisms#

Comprehensive Logging#

Try-Catch Blocks#

Graceful Degradation#

Alerting and Monitoring#

Exception Handling in Notebooks#

Resource Cleanup#

Custom Exception Classes#

Documentation and Training#

Data Pipeline Monitoring#

Testing#

Unit Testing#

Integration Testing#

End-to-End Testing#

Data Quality Checks#

Performance Testing#

Continuous Integration/Continuous Deployment (CI/CD)#

Version Control#

Data Maintenance and Archiving#

Devops#

Data Quality Monitoring#

Data Access Monitoring#

Cost Monitoring#

Infrastructure Monitoring#