Data Engineering Playbook#
Introduction#
This section of the playbook is an essential guide designed for data engineers to streamline and optimize the process of ingesting data into Databricks within DataCore. It serves as a comprehensive resource, providing detailed instructions, best practices, and practical examples to ensure efficient and reliable data ingestion workflows. It covers a wide range of topics, including the fundamentals of data ingestion, setting up the Databricks environment, connecting to data sources which will include (Database, API, Kafka Streaming and SFTP), and implementing both batch and streaming ingestion techniques. Additionally, the playbook addresses critical aspects such as data transformation, quality assurance and performance optimization. By following this playbook, data engineers can effectively learn to create their data pipelines using a structured approach with the DataCore Framework.
Disclaimer
This engineering playbook assumes you have experience in working with Databricks and have some knowledge of Python coding, DevOps processes, CI/CD Processes and executing pipelines.
This playbook currently caters to NN DataCore (Databricks) only. Tools like Snowflake, NNEDH, and NNEDL are in the pipeline.
The playbook covers the lifecycle of the Data Product creation from an engineering perspective. The Data Product creation process begins after the initial design phase as depicted below. This book covers the processes within the "Creation" block.
This playbook focuses only on the creation phase and assumes that the design phase and related documents and processes have been completed.
Data Engineer Development Process#
Goals#
By the end of this chapter, you should be familiar with the following topics
- Setting Up the Databricks environment
- Setting up an API to receive external data into the landing zone
- Pipeline Implementation for Ingesting into Bronze from the landing zone
- Pipeline Implementation for Bronze to Silver
- Implementing Data validation and error handling
- Setting up the orchestration and scheduling using databricks jobs
- Publishing the Data Product
Prerequisites
Before you begin you should make sure you have concluded the Prerequisites. Click here to start your checklist.
Assumptions#
Before you embark on technically building your Data Product, there are some assumptions that you should be aware of.
- The Data Product Design Phase should already be completed as the outputs for the design feed directly into the build (Data Contract, environment setups, source system types which inform the ingestion patterns etc.) Click here for a detailed data product development playbook.
- Access to source systems for you and the team is granted.
- Your Databricks environments are setup and configured. For guidance on the environment setup, click here
- Your local development environment is set up (if not, refer to section below)
- It is assumed that you already have basic knowledge in the following topics (Python programming, Databricks core components, working with GitHub, using Visual Studio). Click here for more information.
Data Engineering Process#
The workflow below depicts at a high-level the process you can expect to follow when creating a data product. Not all steps are mandatory, for example
- if you are the ingestion engineer, you will typically stop when the data reaches the bronze layer.
- if you are building silver or gold layer data products, you will typically start with the bronze to silver pipeline skipping the ingestion processes.
Features of this playbook are aligned to the databricks architecture pillars and will cover these topics.