As Is Architecture - Continuous Integration#
Updated by Gareth Stretch / 2025.03.13
Version Control#
| Version | Date | Owner | Change Description |
|---|---|---|---|
| 0.1 | 18 March 2025 | Gareth Stretch | Gareth Stretch |
Definition#
Documenting the DevOps practices within our Databricks platform is essential for ensuring consistency, reliability, and efficiency in our development and operational processes. By clearly outlining the continuous integration and continuous delivery (CI/CD) pipelines, we provide a framework for automated testing and deployment, which minimizes errors and accelerates the release cycle. Detailed documentation of source code management practices, including version control and repository management, facilitates collaboration and maintains code integrity. This documentation serves as a valuable resource for onboarding new team members, troubleshooting issues, and maintaining best practices, ultimately fostering a culture of continuous improvement and innovation.
DEVOPS - Continuous Integration#
This section lists specific processes to Devops - CICD and Git / ADO integration
Click here for details The development process happens in git repositories in ADO or GitHub(*). A repository defines a "component" - source code, documents, tests, and CI/CD configurations that are released together.
The use of trunk-based development under branch main with squash merges from pull requests. Develop new functionality in feature branches and make pull request, once ready to merge. After initial release, main represents the latest releasable version of the solution.
CI/CD Pipelines#
Explain the continuous integration and continuous delivery (CI/CD) pipelines in place for code deployment
Deployments from CI/CD should be done using the ADO template in dc-terraform-bootstrap or GitHub template in dc-terraform-bootstrap This guarantees that - Remote state is used - The remote state file does not collide with an existing state file from another repository Furthermore, you should run terraform plan as part of the CI/CD against PRD on PRs, so that we can evaluate the actual impact of the pull request against latest deployed in PRD. If you do this, you must use the plan ADO template in dc-terraform-bootstrap or GitHub template in dc-terraform-bootstrap, as it does not lock or refresh the state file (so it does not impact PRD state).
CI/CD encompasses all code and configuration used to establish a controlled ADO pipeline or GitHub workflow to deploy the solution to PRD. It must use the latest main from dc-release.
- its execution:
- should be done from the shared agents provided by ADO or GitHub
- may be done from a dedicated corp-connected agent for testing flows the require corp network access
- Unit tests:
- must only be executed in DEV
-
- Integration tests:
-
- may be executed in DEV
-
- may be executed in TST
-
- must be executed in VAL
-
- may be executed in PRD Ensure that the ADO jobs's displayName or GitHub name correctly reflect the id of the test in the test strategy, so that the logs produce useful test reports.
Source Code Management:#
Currently code is stored in Azure Devops : There are some projects that starting to migrate to github.
- Release Management : View details here
Environment Promotion#
Describe the processes to promote the code and data pipelines to higher environments.
Click here for details
At DataCore, we have decided that executing production workloads should be done using Databricks jobs, rather than Databricks notebooks (or scheduled notebooks).
This choice was made to ensure that code is organized into distinct, testable modules, thereby establishing and maintaining high standards of code quality throughout the development process. We found that when creating packages as notebooks, there was a tendency to write them in a top-down manner, impacting not only the quality and efficiency but also the readability, maintainability, and documentation. As a result, we opted to use Databricks jobs deployed through Databricks bundles, enabling us to structure our workloads according to industry standards and best practices. The objective is to facilitate easier collaboration, debugging, and long-term scalability of projects.
Unit Testing#
Describe the processes to promote the code and data pipelines to higher environments.
Click here for details.
Tests should encompass all code that is not deployed to PRD but is executed during a release to establish assurance that the solution works, as described in the solutions' test strategy.
Python code should be:
unit tested using - pytest to run them, with code coverage - databricks-connect against mock data (e.g. without interacting with unity catalog, via spark.createDataFrame) - integration tested using - Gherkin and behave to write and run them - databricks-sdk to interact with workspace, jobs and runs programmatically - databricks-connect to interact with data in the UC from the CI/CD environment - In general, use pytest and databricks connect to test code, and integration tests to run end to end tests where data must be read from and/or written to the unity catalog (acceptance verification).
View Gaps on testing.