As Is Architecture - Architecture Description#
- As Is Architecture - Architecture Description
- Version Control
- Architecture Description
- Data Source Types
- Data Landing Zone
- Data Ingestion
- Data Transform
- Data Serve
- Orchestration and Scheduling
- Common Components
Version Control#
| Version | Date | Owner | Change Description |
|---|---|---|---|
| 0.1 | 18 March 2025 | Gareth Stretch | Initial Framework created |
| 0.2 | 15 April 2025 | Gareth Stretch | Populated Ingest and Transform |
| 0.3 | 23 April 2025 | Gareth Stretch | Added Snowflake Ingestion |
Architecture Description#
The architecture description documents the current state of DataCore in terms of tooling and processes aligned to the framework pillars (Ingest, Transform, Serve, Orchestrate and Common Components)
Data Source Types#
The data source types section lists the current data types are the are supported by DataCore. This is not a list of source systems but rather the categories of connection types and data formats.
Data Source Types :
The following data source types / document types / file types are supported through the ingestion processes.
- Databricks --> Databricks can also serve data, eg this data in the context of data ingestion reads data from a databricks endpoint, This data lands directly to the bronze layer by-passing the landing zone ( Not Used Yet ).
- Database --> Data read from a database is processed using Python scripts and skips the landing zone and goes straight into the lake as a parquet file using the data lake.
- Json / XML / Excel / CSV / Text / Parquet --> These files are landed in the landing zone for storage based on life cycle policy, after which the file is processed and ingested in the bronze lake.
- API --> Data read from API's is read directly to the landing zone in the original format.
- KAFKA --> Data read from kafka lands directly in the landing zone is the original format.
Interface Types :
- ODBC/JDBC - Connects to database source systems for ingestion database data
- SFTP - Currently used for batch ingestion from local and cloud object storage into the databricks processing engine.
- WebServices - Used for data ingestion from external systems. Not Used
- Rest API's - Used for data ingestion from external systems. This data lands in the landing zone.
- Kafka - Used for data ingestion from Kafka. This data lands in the landing zone. usually connected to the confluent Kafka API endpoint.
Types of Transmission :
- Full Load : Full Loads are currently performed on file and database ingestions (they don't have the mechanism to support soft deletes from source system)
- Incremental Load: Incremental loads are currently performed using date fields as watermark values
- CDC : CDC is a technique for capturing changes in data sources and applying them to target systems. Not Used
ETL Method : listing ELT tools compared to what is currently being used. 1. Delta Live Tables (DLT): Not Used 2. Auto Loader: Not used Auto loader was tried and tested, presented challenges with complex file type, some files were zipped and difficult get working. 3. Python : Used for data ingestion and transformation. varying implementations across different ingestion pipelines. 4. SQL : Used mostly for creating objects in a data catalogue. SQL language is used far less than python. 5. ADF : Not used
Naming Standards#
The following should be documented as part of a data platform naming conventions. This section documents the naming standards as currently set out for the DataCore framework. Content provided based on evidence found. These include the following areas.
- Landing zone folder structure : no documented / proposed structure. E.g ingestion pipeline is treated on a case by case basis.
- Git repos, pipelines : https://datacore-docs.azurewebsites.net/development/adr/accepted/0002-repo-names/#proposal
- Database Objects (databases and Schema's) : https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-datanowcdb/documentation/design_specification/#ds-dc-ingest-datanowcdb-4-landing-to-bronze
- Tables and Views : No documents found
- Workspaces : here
- Columns: No documents found
- Indexes and Constraints : No documents found
- Stored Procedures and Functions : No documents found
- Triggers : No documents found
- Databricks jobs : No documents found
- Security and roles : https://datacore-docs.azurewebsites.net/resources/explanation/access/#datacore-role-naming-convention
- Service principles : hereand here
- Resource groups : https://datacore-docs.azurewebsites.net/development/adr/draft/0003-tag-rgs-with-origin/#resource-group-naming_1
Findings :
- DataCore roles / access groups has a documented naming convention:
- Repositories proposal was submitted. Not fully adhered to. no evidence of conclusion.
- Reference to databricks objects based on databricks recommendation. https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-names
Click here to view the documented gaps for naming standards.
Data Landing Zone#
In some cases, some ingestion pipelines make use of a landing zone and in some cases there is no landing zone. At present the landing zone data is stored in a separate subscription from that of the bronze layer. There is no documented guardrails pertaining to when or when not to use a landing zone, however from observation it appears that sources such as KAFKA, API and File use a landing zone and sources like SQL / Database
There are landing zone designs and pages containing definitions of the landing zone. Additionally there are pipelines designed around the landing principles.
In the context of the Azure Databricks deployment, the following is noted regarding the landing zone design.
- Azure Gen2 Storage account for the landing zone.
- The landing zone contains a folder called /landing/push-ingest for the source to push to. ("Landing Source")
- The landing zone contains a folder called /landing/ ("Landing Process")
- The landing zone folder structure considers failed and success structures
- The Databricks pipeline ingesting the data has a first task, which is responsible for tracking and processing new data files coming in, using a state table.
- If the file is in parquet there is no value in landing in the landing zone. (Databricks recommended)
/Volumes/{catalog}/{schema}/{provider}_volume/
├── table_ingest/
│ ├── succeeded/
│ │ └── YYYYMMDDTHHMMSS/ # Timestamp-based folders for successful ingestion's
│ └── failed/
│ └── YYYYMMDDTHHMMSS/ # Timestamp-based folders for failed ingestion's
└── volume_ingest/
└── YYYYMMDDTHHMMSS/ # Timestamp-based folders for files arrived in SFTP
Repo's used for ingestion evaluation :#
Code and technology usage where performed using the links below.
- https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-external-data/documentation/design_specification/#ds-dc-ingest-external-data-2-source-to-landing.
- https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-edc/documentation/design_specification/#ds-dc-ingest-edc-3-az-copy-script-to-copy-historical-edc-data-from-cdr-to-datacore-landing-zone
- https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-hot/documentation/design_specification/
- https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-unixsce/documentation/design_specification/#ds-dc-ingest-unixsce-4-landing-to-bronze-task
- https://dev.azure.com/novonordiskit/StudyHub%20-%20Data%20and%20Modelling/_git/data_modelling
Click here to view the documented gaps for landing zone standards.
Data Ingestion#
Definition Bronze Layer : Raw data in delta format - The Bronze layer in the Medallion Architecture is the foundational stage where raw data is ingested and stored. The Bronze layer is designed to capture raw data from various source systems in its original format. Apply minimal transformations, primarily focused on data ingestion and storage. This includes adding metadata columns like load date/time and process ID This layer serves as the initial landing zone for all incoming data.
- Maintains the raw state in the structure “as-is”
- Data is immutable (read-only)
- Delivery-based partitioned tables, i.e., YYYYMMDD
- Mostly Parquet. Sometimes other formats
- Can be any combination of streaming and batch transactions
- May include extra metadata (schema)
- May be fed from a “mediation layer”
- Used for debugging, testing
Please not that silver and gold layer (transformation) is covered here as they form part of transform and not ingest.
Databricks offers several options for data ingestion pipelines, each suited to different needs:
1) Batch Ingestion:
- Traditional Batch Ingestion: Load data in bulk at scheduled intervals or manually.
- Incremental Batch Ingestion: Automatically detect and process new records, reducing the load on compute resources.
2) Streaming Ingestion:
- Streaming Tables: Use SQL commands to load data incrementally from cloud storage1.
- Auto Loader: Efficiently process new data files as they arrive in cloud storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage1.
3) Delta Live Tables (DLT):
- DLT Pipelines: Create and manage batch and streaming pipelines using SQL and Python23. DLT handles dependencies and automates data quality checks and transformations
This section covers the data ingestion patterns supported as seen by reviewing various pipelines found in github. Ingestion by definition covers the source to landing and landing to bronze.
1 - SFTP / File:#
Example 1 : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/src/etl/config.py Example 2 : https://datacore-docs.azurewebsites.net/resources/reference/dc-ingest-unixsce/documentation/design_specification/#ds-dc-ingest-unixsce-4-landing-to-bronze-task Example 3 : https://github.com/NovoNordisk-DataCore/dc-ingest-edc Example 4 : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/documentation/design_specification.md
The engineer uses python to load the data from SFTP server into the landing zone. This code runs runs directly on teh databricks cluster and is scheduled as a periodic job.
The ingestion is implemented as a Databricks job containing two tasks written in Python. For transfer of files between the UNIX SCE SFTP server, the Landing Zone, and the Bronze Layer, the following libraries are used:
| Area | Example |
|---|---|
| Types of Data Source | SFTP |
| Formats | CSV, XML |
| Ingestion - Batch | Full load of files |
| Target Storage | Delta Table |
| Data Validation | Verify the file names to ensure they meet the agreed-upon pattern with TSS,Verify if there are any files to be ingested verify if the file to be ingested is not empty |
| Error Handling | try/catch / fail pipeline error handling. |
| Triggering Mechanism | Databricks Job |
| Frameworks Used | functools, Paramiko for SSH/SFTP operations, Cerberus for validation of configuration files, Databricks Utilities (dbutils) for Azure Blob Storage and Unity Catalog Volumes operations |
| Languages | Python, SQL |
2 - Database : Findings for projects using SQL / Oracle as a source#
SQL / database ingestions skip the landing zone and goes straight into bronze.
Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-dmw?path=/src/etl/main.py Example 2 : https://github.com/NovoNordisk-DataCore/dc-ingest-cdw
| Area | Example |
|---|---|
| Types of Data Source | Oracle , SQL , MongoDB |
| Formats | external relational database |
| Ingestion - Batch | Full Load with override |
| Target Storage | Delta Table |
| Data Validation | None |
| Data Transformation | Complex querying from source and landing technically a "silver" dataset into bronze. |
| Error Handling | None, only basic logging. |
| Triggering Mechanism | Databricks jobs |
| Frameworks Used | |
| Languages | Python, SQL |
- In the scenario above, the way the code is structured is query the source and creating a silver product without landing from source into a bronze. Additionally the queries are making use of lambda functions and regex functions in the joins making the job potentially expensive and more complex.
3 - KAFKA :#
Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-kafka?path=/documentation/design_specification.md&_a=preview
| Area | Example |
|---|---|
| Types of Data Source | Cosmos DB, Kafka Event Streams through API |
| Formats | Json |
| Ingestion - Streaming / Realtime | Structured Spark streaming, The topic message json strings are transformed to tabular format by denormalizing the json data. |
| Target Storage | Spark DataFrames to Delta tables , A merge operation is used to ingest data, ensuring that only the latest version is available |
| Data Validation | |
| Error Handling | For each streaming task, logging is enabled. All tables are logged individually when starting and completing a streaming activity |
| Triggering Mechanism | Python code is deployed to run as a job in Databricks. When triggered, the job executes a task per source system, streaming from the source topics to the bronze layer. |
| Frameworks Used | confluent kafka, pylint, black, mypy, bandit |
| Languages | Python, SQL |
4 - API : Findings for projects using API as a source#
There are two approaches to storing API Data.
1) Pattern 1 - Push to Landing
- API is exposed through Mulesoft and data providers call the API. The raw file(usually JSON or XML) is placed in the landing zone for processing. This pattern is currently support in the AWS environment and Muleoft is not deployed in azure.
2) Pattern 2 - Pull to Landing
- A custom integration to a 3rd party API lands the data in raw format in either the landing zone or bronze, this is performed using python.
Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-snow?path=/src/dc_ingest_snow/etl.py Example 2 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-studybuilder
| Area | Example |
|---|---|
| Types of Data Source | Rest / API / Json |
| Formats | JSON / XML |
| Ingestion - Batch | Full loads |
| Target Storage | Delta Table - Full loads The full load of the SB API data to the bronze layer is performed four times per day |
| Data Validation | |
| Error Handling | |
| Triggering Mechanism | Databricks jobs |
| Frameworks Used | aioresult, pydantic_core |
| Languages | Python |
5 - Snowflake : Findings for projects using Snowflake as a source#
Example 1 : https://dev.azure.com/novonordiskit/Project%20Stratus/_git/dc-ingest-cdp?path=/documentation/design_specification.md&_a=preview
To Be Finished : Added 23 | Area | Example | |--- |--- | | Types of Data Source | Rest / API / Json | | Formats | JSON / XML | | Ingestion - Batch | Full loads | | Target Storage | Delta Table| | Data Validation | | | Error Handling | | | Triggering Mechanism| Databricks jobs | | Frameworks Used| aioresult, pydantic_core | | Languages | Python, SQL |
Recovering from Pipeline errors :#
Introduction : When a data pipeline fails in Databricks, there are several options available for reloading data to ensure consistency and minimize downtime:
- Automatic Retries - Delta Live Tables (DLT): You can configure automatic retries for DLT pipelines using properties like pipelines.maxFlowRetryAttempts and pipelines.numUpdateRetryAttempts1. These settings allow the pipeline to retry a specified number of times before failing completely.
- Checkpointing - Streaming Queries: Enable checkpointing for streaming queries. This ensures that the query can restart from the last successful checkpoint, minimizing data loss2.
- Manual Reload - Full Refresh: Manually trigger a full refresh of the target tables. This is useful if the pipeline has been down for an extended period and you need to ensure data consistency.
- Error Handling and Alerts - Custom Error Handling: Implement custom error handling within your pipeline code to catch and log errors, and potentially trigger alerts or retries based on specific conditions. Monitoring and Alerts: Use Databricks monitoring tools to set up alerts for pipeline failures. This allows you to quickly respond and take corrective actions3.
- Data Validation and Quality Checks - Data Quality Rules: Implement data quality rules within your pipeline to validate data before it is processed. This can help catch issues early and prevent pipeline failures.
This section describes how pipeline errors / failures are currently handled.
Most ingestion pipelines use databricks jobs for scheduling ingestions. Retry policies are used (which includes max number of retries and interval between retries). Additionally , there is error handling in the pipelines by means of Try-Catch blocks. There is no consistency in how these are implemented, varying techniques exists. In most cases the try catch logs the error to the logger and bubbles up and the entire pipeline fails. A comprehensive retry strategy needs to be documented and adhered to.
Gaps for Ingestion :#
In general, much of the code and plumbing is hand-rolled and manually written in python by the engineers. Using standard spark libraries will provide for native built in support for database connectivity, querying, configuration management SQL Execution, basic logging, auditing, error handling capabilities.
DDL statements are written directly in the code as part of the ETL process. e.g creation of catalogue, schema ,table : This practice is generally not a good practice. This should be extracted out of the python code and have seperate jobs that as part of the deployment create the required objects.
Click here to view the gaps identified for ingestion.
Data Transform#
Definition - Transformation : The process of moving the data from bronze to silver to gold. Data transformation is the process of converting, cleansing, and structuring data into a format of the target layer (usually delta tables).
Definition - Silver : Filtered, cleaned and augmented - The Silver layer in the Medallion Architecture is designed to cleanse, conform, and standardize data from the Bronze layer (raw data). This layer ensures that the data is accurate, consistent, and ready for further analysis and reporting.
- Uses data quality rules for validation
- Usually only functional data
- Historization is merged (SCD2)
- Efficient storage format; Delta
- Versioning for rolling back
- Handles missing or incorrect data
- Usually enriched with reference data
- Source-oriented, although queryable and cluttered around subject areas
- Usually used by operational analytical teams
Definition - Gold : Business-level aggregates - The Gold layer in the Medallion Architecture is the final stage where data is highly refined and optimized for business intelligence, analytics, and machine learning. The Gold layer is designed to provide curated, aggregated and optimized datasets that are ready for consumption by business users, analysts, and data scientists.
- What enterprises call data products: consumer-ready / user-friendly data
- Data is highly governed and well-documented
- Historization is applied only for the set of use cases or consumers
- Contains complex business rules, such as calculations and enrichments
- Efficient storage format; Delta
- Versioning for rolling back
- Might contain additional sub layers for sharing or distributing data
This section lists below the content for data transformation techniques and tools in general as implemented by Novo Nordisk, it covers silver and gold where applicable.
Transformation Tooling#
There are several options available within Databricks for transformation. There are two types of data transformations: declarative and procedural.
- Python - This is the most widely used approach for current data transformations. This is the procedural approach.
- DLT - Not Used This is the simplest way to implement data transformations but it doesn't cover all use cases. Our recommendation is to use DLT when its applicable for your specific use case. refer to this page for target architecture. This is the declarative approach.
- Spark Structured Streaming
- Notebooks : DataCore have decided that executing production workloads should be done using Databricks jobs, rather than Databricks notebooks (or scheduled notebooks)Click here
- Workflows
- Databricks Machine Learning
- DBT Framework - Used by StudyHub This is the declarative approach. Click here
Transformation for Batch#
- Currently the most widely used write mode is "Overwrite" with a fully dataset being stored in silver.
- CDC - Change data capture is not currently used.
Transformation for Near Realtime#
- Structured streaming / Streaming tables is not used.
Data Quality Management#
The data quality management is usually done on stages of the pipeline (landing to bronze, bronze to silver, silver to gold), however the main quality validation processing is performed on the load to silver as one of the main goals of silver is to deliver curated data. The section below documents the findings for the silver layer.
- Schema Enforcement : This process happens when the data loaded into bronze. Only basic validation of the data is performed using python.
- Constraints: Not Used / Didnt find a plan / strategy pertaining to contstraint management The constraints guarantee the validity of the data but they require extra processing which leads to extra load time and extra cost.
- Data Quality Rules : DQX The data rules should be documented in a data contract and then translated into constraints or expectations.Not Used
- Monitoring and Alerts : Discuss the tools and processes for monitoring data quality and setting up alerts for anomalies.
- Data Quality Framework options
- Native - Databricks - DLT Not Used
- Native - Databricks - Python - Great expectations Not Used
Change Management#
Change management is the process to modifying a data-pipeline to accommodate either new business requirements or changes in the datasources. One of the biggest challenges in data management is to ensure that existing code will not be broken and new bugs will not be introduced. For efficient change management automated regression testing is key.
Every change has to be assessed based on the impact to database structure schema. Small changes can be implemented by simple alter statements and small additions / changes of pipeline logic. There are changes that may require recreations of tables and as a result reprocessing or reloading of historical data. These changes usually require a dedicated project.
The art of the initial design is to minimize the need for the changes of the second type mentioned above.
Changes fitting the small change category should be separated from the changes fitting the larger change category and in the target state, a detailed change management process should be design which would include automating the process as much as possible.
Deduplication#
The process of de-duplication is avoiding the creation of duplicate records in the target layer. even when the same data is sent to the data platform, it will be inserted into the target layer only once. The most important part of de-duplication is defining the criteria to ensure a record is unique, this begins with an understanding of the data product requirements.
- Case 1 : We consider records the same if they have the same primary key fro the source system, this method is used when we are not interested in only the latest version of the record.
- Case 2 : We consider records the same if the primary key and a set of business attributes are the same.
Currently the process of de-duplication is by means of "Select DISTINCT" with full table reloads, this will not work for delta loads and CDC. Also "Select DISTINCT" compares all columns defined in the query which may not be the correct criteria of all use cases.
Options to consider : Develop a re-usable template that engineers use to configure and derive uniques for a given dataset. We can add to every dataset and new column (Calculated hashed value ) based on set of attributes that define uniqueness of the record which is then used for uniques checks. This can be defined as UDF which is stored in Unity Catalogue.
Error Handling#
There is a difference between error handling and data validation. Data validation is the process of finding an error / exception with the data record itself, error handling is the process to act upon the exception. There are two types of exceptions, critical and non critical. Critical exceptions lead to the termination of the pipeline and non-critical exception leads to recording reporting an exception. Some non critical exceptions can be automatically fixed by the cleansing process.
Example Pipeline : https://github.com/NovoNordisk-DataCore/dc-ingest-rtsm-prancer/blob/main/src/etl/source_to_bronze.py
Exception handling is performed differently across various projects. In most cases, if an exception is encountered, the pipeline terminates and closes any open connections.
Sample of Exception Handling below is an example of a critical exception which terminates the pipeline. The information of the error is logged to the logger.
try:
with ThreadPoolExecutor() as executor:
conf["svcs"]["executor"] = executor
result = run_workflow(conf)
except Exception as err: # pylint: disable=broad-except
svcs["logger"].error(err, exc_info=True)
result["error"] = "failed to run workflow"
finally:
sftp_close(conf)
if "error" in result:
raise RuntimeError("Task had error(s)")
Historisation#
Historisation refers to the process of capturing and maintaining historical versions of data records. This is crucial for tracking changes over time , business purposes , auditing, data quality control and ensuring data integrity.
Currently there is limited Historisation on the bronze layer. Either the latest version of the record is available (no historisation) or data is just appended to the table without additional meta data required for efficient analysis of historical data.
- From what has been reviewed, "overwrite mode" was used for full data reload.
- Technically the previous version of the data is available and could be restored using databricks time travel, however this option is not designed for "real" Historisation and can rather be used for data recovery. This approach is not ideal for historical data analysis.
Primary Key Management#
Currently primary keys from Source systems are used as the primary key :
Pros's : - Simple implementation. Con's : - Heavy change management in case of source system change. - Keys across systems can collide. - Source systems sometimes only have composite keys which makes the record harder to use.
This is not an issue for DataCore as this is currently only for bronze.
Click here to view the transform gaps. Click here to view the transform guardrails.
Data Serve#
View Gaps here
Data serving involves making data available for consumption by applications, users, or other systems. This includes tools such as :
Databricks SQL: Provides a SQL-based interface for querying and serving data, ideal for analytics and reporting. This is the most commonly used approach in DataCore today. Delta Sharing: Delta Sharing is an open protocol for secure data sharing. It allows organizations to share data across different platforms and cloud providers in a secure and efficient manner .: This is not used currently although designs are in place for usage Databricks Model Serving: Allows you to deploy and serve machine learning models as endpoints, supporting both batch and real-time inference : This is not used currently. Query Interfaces via databricks API:
Serving through API#
Not Used - refer to target architecture here
Exporting to Interface Files#
Not Used :
Publishing Data Products to Marketplace#
Not Used While currently not used, POC's have been conducted and work is underway to agree on the processes aligned to data publishing into the Marketplace.
Publishing data changes to Kafka#
Not Used
Databricks SQL#
Serving data through Databricks SQL involves using Databricks SQL to query, analyze, and visualize data stored in your Databricks Lakehouse platform. This is the most frequently used option for Novo Nordisk.
Data Sharing#
Data sharing refers to the ability to share data securely with other users or organizations. Databricks provides several mechanisms for data sharing, including:
Delta Sharing: An open protocol developed by Databricks for secure data sharing with other organizations, regardless of the computing platforms they use. It allows you to share live data sets, models, dashboards, and notebooks across platforms, clouds, and regions.
Cross Regional Delta sharing is not currently used, however there are plans in place to enable this.
Unity Catalog: Facilitates data sharing by providing centralized governance and fine-grained access controls. It ensures that data sharing is secure and compliant with organizational policies.
- Data Sharing Mechanisms today:
- Delta Sharing + Cross region: : A data owner creates a share in Unity Catalog, which can include tables, table partitions, views, materialized views, volumes, notebooks, and AI models and then grants access to the share to one or more recipients. NovoAccess enabled Entra-ID groups are used to manage access to data.
The Real World Data Assets team will manage the access to the data. [view here(https://datacore-docs.azurewebsites.net/development/multi-region/#region-to-region-data-sharing)] Delta Sharing currently remains with the DataCore team, as it requires elevated permissions to create shares. - Databricks-to-Databricks Sharing: Databricks-to-Databricks Sharing is designed for sharing data and AI assets between Unity Catalog-enabled Databricks workspaces : Not Used. - Open Sharing Protocol: The Databricks Open Sharing Protocol, part of Delta Sharing, is designed to facilitate secure and real-time data sharing across different platforms and organizations. This is being explored here
- Data Governance and Security:
- Unity Catalog: Unity Catalog is used to manage and govern shared data, including access controls, auditing, and usage tracking.
- Access Controls: DataCore has been managing access to data via RBAC model in dc-interface-agreements. This approach is intended for system to system access, where a system is a single identity. While DataCore itself does not have end users due to its infrastructure nature, business applications built on top of DataCore do have end users (persons), and do need to establish a robust access model for them
- Encryption: Encryption is enabled at the storage layer : Data encryption at rest with AES-256 and full disk encryption.
- Data Sharing Use Cases:
- Internal Collaboration: Outline how data is shared within the organization to facilitate collaboration between different teams and departments.
-
External Collaboration: Document use cases where data is shared with external partners, clients, or vendors for joint projects or data-driven initiatives. 4. Data Sharing Tools and Interfaces:
-
Databricks Marketplace: Currently in design and pending implementation.
- Clean Rooms: : Databricks Clean Rooms are secure, privacy-centric environments designed for collaborative data analysis and AI without providing direct access to raw data Clean Rooms is not used
- Monitoring and Auditing:
- Usage Tracking: Document the tools and methods used to track the usage of shared data, including who accessed the data and how it was used.
- Audit Logs: Databricks logs across the different nodes are collected for job runs, pipeline runs, there is no central dashboard collating this logging information.
Click here to view the serve gaps. Click here to view the serve guardrails.
Orchestration and Scheduling#
View Gaps here
Orchestration and scheduling in Databricks involve coordinating and managing the execution of data processing tasks and workflows to ensure they run efficiently and reliably.
Orchestration : Orchestration in Databricks refers to the process of automating and managing complex workflows that involve multiple tasks. These tasks can include data ingestion, transformation, and machine learning model training. (Jobs, Tasks, Control Flow and Integration)
Scheduling : Scheduling in Databricks involves setting up jobs to run at specific times or intervals. This ensures that data processing tasks are executed automatically without manual intervention. (Triggers, Parameters and Notifications)
*Databricks Tools for Orchestration and Scheduling: *
- Databricks Jobs : Jobs and tasks are currently used in the context of DataCore Ingestion orchestration and job scheduling, These are generally well documented within each ingestion project repository. Production workloads must be executed on Databricks' jobs, as opposed to Databricks notebooks (or scheduled notebooks). These are documented here for production workloads
- Delta Live Tables (DLT) : DLT allows you to define and manage data pipelines with built-in data quality checks and monitoring : Not Used
- Databricks Workflows : These are used with very basic implementation.
- Integration with External Tools : Databricks can integrate with Apache Airflow, Dagster for more advanced workflow automation and management and For users on Azure, Databricks can be integrated with Azure Data Factory to orchestrate data workflows across various services, however the view on Datafactory is that it is a bloated, buggy tool and the need for an opensource tool is a better option : Not Used :
Click here to view the orchestration gaps. Click here to view the orchestration guardrails.
Common Components#
Common Components are components that every data platform should have, these common components should be unified tool, templates and processes that each engineering team can leverage in order to affectively implement data solutions and write less code. The components include
- Data Validation Components
- Exception Handling Components
- Pipeline Monitoring Components
- Testing Components
- Devops Components
- Data Maintenance Components
They are discussed below in detail.
Data Validation#
The topics below are considered essential with working with data in databricks. This section lists each topic and then explains the current usage.
Schema Enforcement#
Description : Ensure that data written to a table adheres to a predefined schema. This helps in maintaining data quality by enforcing schema and constraint checks on write . Implementation : Use Databricks Delta Lake to enforce schema rules and constraints. As-Is Notes :
Table Constraints#
Description: Apply constraints such as NOT NULL, CHECK, primary key, and foreign key constraints to ensure data integrity. Implementation: Use SQL commands to add constraints to tables. As-Is Notes :
Define and Configure Expectations Using Delta Live Tables (DLT)#
Description: Set expectations for data quality when declaring materialized views or streaming tables. Configure these expectations to warn about violations, drop violating records, or fail workloads based on violations. Implementation: Use DLT to define and manage data quality expectations. As-Is Notes :
Data Monitoring#
Description: Continuously monitor data quality and performance over time to detect and address issues. Implementation: Implement monitoring tools and dashboards to track data quality metrics. As-Is Notes :
Cast Data Types#
Description: Ensure that data types are correctly cast to maintain consistency and prevent errors. Implementation: Use SQL commands to cast data types appropriately. As-Is Notes :
Custom Business Logic#
Description: Define custom rules and logic to validate data according to specific business requirements. Implementation: Implement custom validation logic using SQL or Python. As-Is Notes :
Pattern Enforcement#
Description: Use regular expressions (regex) to enforce expected patterns in data fields, ensuring data adheres to specific formats. Implementation: Use REGEXP or RLIKE functions in SQL to enforce patterns. As-Is Notes :
Value Enforcement#
Description: Enforce value ranges on columns to ensure only valid values are inserted or updated. Implementation: Use CHECK constraints in SQL to define valid value ranges. As-Is Notes :
Handling Null or Missing Values#
Description: Enforce NOT NULL constraints to prevent null values in critical columns. Implementation: Apply NOT NULL constraints to columns where null values are not acceptable. As-Is Notes :
Data Validation Strategy#
Description: Develop a comprehensive strategy for data validation, including user acceptance criteria and test strategies. Implementation: Define validation steps and criteria early in the project lifecycle. As-Is Notes :
Exception Handling#
Custom Error Handling Functions#
Description: Utilize custom error handling functions to capture and log exceptions, enabling better troubleshooting and analysis. Implementation: Define custom functions in your code to handle specific types of errors and log detailed information. As-Is Notes :
Automatic Retry Mechanisms#
Description: Implement retry logic to handle transient failures gracefully and ensure job resilience. Implementation: Use libraries like retrying to automatically retry operations that fail due to temporary issues. As-Is Notes :
Comprehensive Logging#
Description: Integrate logging functionality to record detailed error messages, stack traces, and execution context for improved debugging. Implementation: Use logging frameworks to capture and store logs systematically. As-Is Notes :
Try-Catch Blocks#
Description: Use try-catch blocks to handle exceptions and prevent them from causing the entire job to fail. Implementation: Wrap critical code sections in try-catch blocks to manage exceptions locally. As-Is Notes :
Graceful Degradation#
Description: Ensure that your application can continue to operate in a degraded mode when certain errors occur. Implementation: Design your workflows to handle partial failures and continue processing unaffected parts. As-Is Notes :
Alerting and Monitoring#
Description: Set up alerting mechanisms to notify you when exceptions occur, and monitor the system for unusual patterns. Implementation: Use Databricks monitoring tools and integrate with external alerting systems like PagerDuty or Slack. As-Is Notes :
Exception Handling in Notebooks#
Description: Include exception handling in notebooks to manage errors during interactive data analysis. Implementation: Use try-except blocks in notebooks and ensure proper error handling for each cell. As-Is Notes :
Resource Cleanup#
Description: Ensure that resources are properly cleaned up after an exception occurs to prevent resource leaks. Implementation: Use finally blocks or context managers to release resources like database connections or file handles.
Custom Exception Classes#
Description: Define custom exception classes to represent specific error conditions in your application. Implementation: Create custom exceptions that inherit from the base exception class and use them to signal specific error conditions. As-Is Notes :
Documentation and Training#
Description: Document your exception handling strategy and train your team on best practices. Implementation: Maintain comprehensive documentation and conduct training sessions to ensure everyone understands how to handle exceptions effectively. As-Is Notes :
Data Pipeline Monitoring#
Testing#
In the context of creating data pipelines and data products within Databricks, testing refers to the systematic process of validating and verifying that the data workflows, transformations, and integrations function correctly and meet the desired quality standards.
This involves several key activities, including unit testing to ensure individual components work as expected, integration testing to verify that different parts of the pipeline interact correctly, and end-to-end testing to validate the entire data flow from ingestion to final output. Additionally, testing encompasses data quality checks to ensure the accuracy, consistency, and completeness of the data, performance testing to assess the efficiency and scalability of the pipelines, and regression testing to confirm that new changes do not introduce errors or degrade existing functionality.
Effective testing strategies also involve the use of continuous integration/continuous deployment (CI/CD) pipelines to automate testing and deployment processes, comprehensive logging and monitoring to track pipeline performance and detect issues, and the use of version control to manage changes and maintain traceability. Overall, testing in Databricks ensures that data pipelines and products are reliable, maintainable, and capable of delivering high-quality data to support business needs.
Unit Testing#
Description: Implement unit tests to validate individual functions and components of your code . Implementation: Use frameworks like pytest for Python, test that for R, and ScalaTest for Scala. Write unit tests within notebooks to ensure code correctness early in the development process. As-Is Notes :
Integration Testing#
Description: Perform integration tests to ensure that different components of your pipeline work together as expected . Implementation: Test the interactions between various modules, including data ingestion, transformation, and loading processes. As-Is Notes :
End-to-End Testing#
Description: Conduct end-to-end tests to validate the entire data pipeline from raw data ingestion to final data delivery. Implementation: Simulate real-world scenarios and workflows to ensure the pipeline operates correctly under expected conditions. As-Is Notes :
Data Quality Checks#
Description: Integrate data quality checks to verify the accuracy, consistency, and completeness of data. Implementation: Use Delta Live Tables (DLT) to define and enforce data quality expectations, and monitor data quality metrics. As-Is Notes :
Performance Testing#
Description: Conduct performance tests to evaluate the efficiency and scalability of your data pipelines. Implementation: Measure execution times, resource utilization, and throughput to identify bottlenecks and optimize performance. As-Is Notes :
Continuous Integration/Continuous Deployment (CI/CD)#
Description: Set up CI/CD pipelines to automate testing, integration, and deployment processes. Implementation: Use GitHub Actions, or Azure DevOps to automate the build, test, and deployment workflows. As-Is Notes :
Version Control#
Description: Use version control systems to manage changes to your code and pipeline configurations. Implementation: Integrate with Git to track code changes, collaborate with team members, and maintain a history of modifications. As-Is Notes :
Data Maintenance and Archiving#
Devops#
View here
View Gaps here
Data Quality Monitoring#
Implementing data quality monitoring in Databricks involves leveraging several built-in features and tools to ensure the integrity, accuracy, and reliability of your data, eg Statistical Properties: Track data integrity, distribution, and drift over time. while the data quality metrics within each ingestion is defined in the functional specification, there is no consistent pattern or template for affectively measuring data quality. things like consistency, accuracy, validity, completeness, timeliness and uniqueness. As currently observed, these topics are address directly within each implementation in pockets. E.g there may be a select DISINCT clause in the SQL query, but no end to end data quality monitoring framework or implemenation.
Data Access Monitoring#
Audit Logs: Monitor user activities, workspace changes, and security events - out of scope
Cost Monitoring#
- out of scope
Infrastructure Monitoring#
- out of scope
View Gaps for Common Components here View Guardrails for Common Components here