Skip to content

Index

Engineering Gaps and Recommendations#

Definition : A Data Platform Engineering Gaps and Recommendations Document serves as a comprehensive guide to provide recommendations and highlight shortcomings in the current data platform's architecture, processes, and tools.


Version Control#

Version Date Owner Contributors Change Description
0.1 8 April 25 Gareth Stretch Gareth Stretch Populated Current architecture section.

Table of Contents
  1. References
  2. General gaps and recommendations
  3. Ingestion gaps and recommendations
  4. Transform gaps and recommendations
  5. Serve gaps and recommendations
  6. Orchestration gaps and recommendations
  7. Common Components gaps and recommendations

Target Audience#

Audience Comment
Data Architects They need to understand the current architecture to design and implement improvements or new solutions.
Data Engineering hey use the blueprint to comprehend the existing data pipelines, storage solutions, and processing frameworks to maintain and optimize them.
Compliance Officers They ensure that the architecture complies with regulatory requirements and data governance policies.
IT Administrators They ensure that the platform is running smoothly and securely, adhering to the architecture's guidelines and constraints. *Excluded from this scope.
Data Analysts They need to ensure and understand how to manage Data Quality, Compliance, Integrity and Data Consistency.
Business Analysts They use the blueprint to understand how data flows through the organization and how it can be leveraged for insights and decision-making.

As-is Architecture#

The diagram below is a high-level overview of the azure databricks data platform as highlighted in the as-is deliverable found here.

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

To Be Architecture#

The diagram below is a high-level overview of the azure databricks data platform as highlighted in the to-be deliverable found here.

Hold "Alt" / "Option" to enable Pan & Zoom
screenshot

Christian : As-Is picture is somewhat old. We need to workshop the updates as current pipelines have closed alot of the gaps. Greyed out boxes should be called out specifically. like ADF. documented through ADR's - eg limiting the number of tools.

Gudurn : Term Data Core seems to be mapped to Azure Databricks, but mixed with Data Hub, we need to align of what we mean by data core.

Jonatan : We should focus on clarifying what this is and means. Align target architecture for Engineering playbook. remove whats not relevant. Show what is in scope of the Data engineering PLaybook.

Jyotshna : We would be careful about our as-is picture, engineering is across cloud. The diagram should inly show what the playbook covers.

Neelesh : He feels unity catalogue is at the centre of our diagram, can we have foreign data sources logged in Unity :, Jonatan, says CDP is sharing data to study hub through unity Catalogue. These are supported and actively being explored.

Tim : Target architecture is causing confusion and is not multi-cloud.

General Gaps and Recommendations#

This section cover gaps and recommendations identified for general topics outside of the pre-defined architecture pillars.

Type Priority Area Description Recommendation / Justification Workshop Notes
Gap 1 Naming Standard Publish naming standards Document and publish naming standards for all Databricks artifacts. Introduce a process to validate naming standards automatically.

Have a central repository that documents all naming conventions grouped by topic. e.g Azure Data platform (Subscriptions, Resources Groups, service principles, storage accounts etc) and then Data platform (Database, schema , views tables etc.)
Catalogue naming : Datasources, And the workspace name, catalogue name is enforced. Jonatan says its a question of when do you create a new catalogue. its more about how do you work with a catalogue.
Recommendation 2 Infrastructure Lower the barrier of entry First optimize the structure of the infrastructure request yaml file. Different types of infrastructure (managed node, managed node with external landing zone, self managed node) should be covered in 1 request file.

This will simplify and speed up the deployment process. Validate naming standards in infrastructure request file. Consider generating automatically workspace names, catalogue names, service principle based on application name. THis is for the platform engineering / orchestration team.
Jonatan says this is not a data engineering concern but he agrees with
Recommendation 1 Data Contract Data Contract manual definition is not scalable. Average data model contains several hundred tables that can lead to a data contract definition file in tenth of thousands line. Such files are not maintainable manually. Create data model in erwin first. Generate part of data contract automatically based from erwin model. Extend erwin to include data contract metadata.

Generate ddl scripts automatically based on data model.
erwin doesn't make sense for bronze as it maps to domain models (silver and above). Jonatan says we are not landed on a data contract. Christian agrees, there is a group working on it from a data modelling perspective. Data.world, start with meta model, then physical model and this is tied to the data contract. TKEO + SLVA for data modelling.
Recommendation 3 Usage of inconsistent data There is no mechanism to prevent using inconsistent data (e.g using data possibly in the middle of the load process before full completion of the data load) Provide access only to the fully loaded version of the dataset, either through interfaces views or using paradigms like blue / green deployment. Jyotshna says this is a lower priority and should be handled through the CICD process. Oracle themselves dont honour their data consistency
Recommendation 2 Historisation There is no common approach for historization on Bronze, Silver and Gold Layers Introduce historization concept, common attributes to support historization and meta data to support historization. During data modelling process assign to each table required historization type. Introduce common templates for different historization types.