Setting Up Your Project#
This guide will help you quickly scaffold a new data engineering project using the Data Pipeline Templates.
Overview#
Copier is a library and command-line tool for rendering project templates. It allows you to generate a new project structure from the template repositories, making it easy to start new projects with consistent structure and best practices.
The Data Engineering templates follow a modular architecture where you apply templates in sequence based on your project needs:
- dc-template-core: Base project structure with CI/CD skeleton and Databricks Asset Bundle
- dc-template-python-project: Python project setup with package management
- dc-template-data-product: Data product publishing and quality validation capabilities
What You Will Learn#
- How to scaffold a complete data engineering project using templates
- How to configure GitHub Actions for automated deployment
- How to connect your project to Azure Databricks
Prerequisites#
Required Access#
- Access to
innersource-nnGitHub organization where template repositories reside (NovoAccess Entitlement GitHub : NN-GitHub-Users) - Access to a Databricks workspace where you can deploy and run your pipelines. For experimentation, you can use ToyHub, for your project follow this guide to create or get access to an NNDCP Databricks environment.
- Service Principal with permissions to deploy to Databricks. This is created by the NNDCP team when you create a managed Databricks environment. Read more here.
Required Tools#
- Python 3.11+ - Python runtime environment
- pipx - Tool installer for Python applications
- hatch - Used for deployment
- Azure CLI - For Azure authentication
- Databricks CLI - Databricks related operations
- Copier 9.x - Template rendering tool
- Git CLI - Version control system
- GitHub CLI - For easy repository creation
Windows
Python
Recommended python version is Python 3.11+
pipx
py -m pip install --user pipx
py -m pipx ensurepath
:: Restart your Terminal.
Hatch:
pipx install "hatch<1.16.0"
pipx install hatch-pip-compile
Azure CLI:
pip install azure-cli
Databricks CLI:
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
Copier:
pipx install copier=="9.*"
pipx inject copier copier-templates-extensions
Good to have:
- Git CLI - Version control system
- GitHub CLI - For easy repository creation
Mac
Python
Recommended python version is Python 3.11+
pipx
brew install pipx
pipx ensurepath
source ~/.zshrc
Hatch:
pipx install "hatch<1.16.0"
pipx install hatch-pip-compile
Azure CLI:
brew install azure-cli
Databricks CLI:
brew tap databricks/tap
brew install databricks
Copier:
pipx install copier=="9.*"
pipx inject copier copier-templates-extensions
Good to have:
- Git CLI - Version control system
- GitHub CLI - For easy repository creation
Required Setup#
Follow the setup as outlined in dc-release user guide
-requirementsrepo to capture the business solution requirements-release-logrepo where the release notes will be published by the CICD pipeline- 2 GitHub apps/service principals: one with read permissions on your repos and the other with write permissions on
-release-logrepo to publish release notes
Knowledge Prerequisites#
- Basic Git commands
- GitHub Actions and workflows
- Understanding of Python virtual environments
- Familiarity with Databricks
Step-by-Step Process#
Quick Trial (Sample/Training Use Only)
If you want to quickly try out the templates without needing to provision a Databricks Environment, you can follow the Quickstart Guide to use the templates in ToyHub Environment .
Step 1: Verify Copier Installation#
Confirm Copier is installed and available (see Required Tools above for installation):
copier --version
Step 2: Generate Core Project Structure#
Create your project directory and apply the core template:
copier copy --trust git@github.com:innersource-nn/dc-template-core.git my-project
Replace
my-projectwith your preferred local path.
Copier will prompt you for project-specific information:
- project_name: Your project and repository name (lowercase, hyphens only)
- databricks_host: Your Databricks workspace URL (e.g.,
https://adb-xxx.azuredatabricks.net/) - classic_compute: Whether you need classic Databricks compute or Serverless compute
- databricks_runtime_version: Verify the package versions based on Compute selected
- databricks_runtime: Default selected version based on user provided input for
databricks_runtime_version, please keep the default properties only and do not modify - development_target__manage_group_name: Azure AD group name that should have CAN_MANAGE permissions for development bundles.
- use_microsoft_teams_notifications: Would you like to receive Databricks job status notifications on Microsoft Teams?
- development_target__teams_webhook_id: Microsoft Teams webhook ID for Databricks job run notifications during development.
- apply_other_templates: Gives users option to initialise other templates, takes a boolean value (default is true)
- templates: If the provided value for
apply_other_templatesis true then the user can opt for initialising other templates (e.g. python and marketplace) for their project
Example Prompts
π€ project_name
cp-ingest-coupa
π€ databricks_host
<https://adb-4321798407796733.13.azuredatabricks.net/>
π€ classic_compute
No
π€ databricks_runtime_version
Serverless environment version 4
π€ development_target__manage_group_name
your_ad_group_name
π€ use_microsoft_teams_notifications
true
π€ development_target__teams_webhook_id
123456
π€ apply_other_templates
true
π€ templates
[dc-template-python-project, dc-template-data-product]
Expected Outcome: A project directory with core structure, GitHub Actions workflows, Databricks Asset Bundle file, and documentation templates. If you selected additional templates, they will be automatically applied.
Step 3: Apply Python Project Template#
Note: If you selected
dc-template-python-projectduring the Core Template setup (Step 2), you can skip this step.
Navigate to your project and apply the Python project template if you haven't already:
cd my-project
copier copy --trust git@github.com:innersource-nn/dc-template-python-project.git .
This adds:
- Python project configuration (
pyproject.toml) - Package management setup
- Databricks Asset Bundle configuration for deploying Python wheels
- Development dependencies and tools
Template Answers
The Python template automatically reads configuration from
.copier-answers.core.yml, so you won't be prompted again for previously
entered values.
Expected Outcome: Enhanced project with Python packaging capabilities and wheel deployment configuration.
Step 4: Apply Data Product Template#
Note: If you selected
dc-template-data-productduring the Core Template setup (Step 2), you can skip this step.
If you plan to publish data products to NN Data Marketplace or use data quality validation features, and haven't applied this template yet:
copier copy --trust git@github.com:innersource-nn/dc-template-data-product.git .
Copier will prompt you for project-specific information regarding marketplace and data quality features:
-
enable_marketplace_publishing: Enable publishing to NNDM Data Marketplace? (Y/n)
(Adds marketplace workflows and data contracts). If you choose Y, you'll be prompted for:- servicenow_business_application_id: ServiceNow Business Application ID
- product_name, product_id, product_description: Basic marketplace metadata
- data_contract_team_id, data_contract_owner_contact_email
- data_contract_owner_contact_name: Ownership information
-
enable_quality_validation: Include data quality validation capabilities? (Y/n)
(Adds validation scripts, libraries, and optional features). If you choose Y, you'll be prompted for:- include_soda_checks: Include Soda Core for additional quality checks? (Y/n)
- enable_databricks_job: Generate Databricks Asset Bundle for automated validation jobs? (Y/n)
- include_example_notebooks: Include example Databricks notebooks for validation? (Y/n)
-
novo_initials: Your Novo Nordisk initials
- project_description: Brief description of your data contracts
- catalog_base_name, schema_name: Databricks catalog and schema configuration
This adds:
- Data quality framework and Soda validation
- Data contract templates (if marketplace enabled)
- GitHub Actions workflow for NNDM publishing (if marketplace enabled)
- Configuration for data product metadata
NNDM Prerequisites
Before using marketplace publishing features, ensure you've been onboarded to the NNDM platform and have received your API key and CI tool credentials.
Expected Outcome: Project configured for publishing data products with contracts to NN Data Marketplace and data quality validation capabilities.
Step 5: Initialize Git and Create GitHub Repository#
Initialize Git in your project directory:
git init
git add .
git commit -m "Initial commit from data engineering templates"
Create a new repository on GitHub using the GitHub CLI:
gh repo create <org-name>/my-project --private --source=. --push
Or create manually through the GitHub web interface and then:
git remote add origin git@github.com:<org-name>/my-project.git
git branch -M main
git push -u origin main
Expected Outcome: Your project is now version-controlled and pushed to GitHub.
Step 6: Configure GitHub Apps, Secrets and Variables#
The CI/CD pipeline from the templates requires a few secrets and variables for deployment.
We recommend using GitOps (Configuration as Code) to manage secrets and variables, ensuring that changes are peer-reviewed and version-controlled.
GxP Compliance Note
Manual changes to variables through the GitHub Web UI are not recommended for regulated environments as they lack a clear audit trail.
Manage your GitHub repository configurations via DataCore's NovoNordisk-DataCore/github management repository by following these steps:
- Fork the NovoNordisk-DataCore/github repository.
- Add your repository configuration as a YAML file in
/service_requests. - Submit a Pull Request.
Please see an example configuration file in the NovoNordisk-DataCore/github repository for reference.
Required GitHub Apps & Secrets#
You need 2 service principals on GitHub (called GitHub apps), that we call the reader and release-log-writer.
readeris able to read from your organization and requires the below permissions:- actions: read
- contents: read
- metadata: read
- pull_requests: read
- deployments: read
- variables: read
release-log-writeris able to write torelease-logand requires the below permissions:- contents: read_and_write
- metadata: read
| Secret Name | Description | How to Obtain |
|---|---|---|
GH_APP_ID |
reader GitHub App ID |
Create a GitHub App in your organization with the necessary permissions* |
GH_PRIVATE_KEY |
reader GitHub App private key |
Generate a private key after creating the GitHub App** |
GH_RELEASE_LOG_APP_ID |
release-log-writer GitHub App ID |
Create a GitHub App in your organization with the necessary permissions* |
GH_RELEASE_LOG_PRIVATE_KEY |
release-log-writer GitHub App private key |
Generate a private key after creating the GitHub App** |
NNDM_API_KEY |
NNDM API key | Provided by the NNDM team during onboarding |
ACR_USERNAME |
Created for NNDM CI-Tool permissions | Provided by NNDM team |
ACR_PASSWORD |
Created for NNDM CI-Tool permissions | Provided by NNDM team |
* To create a GitHub App, go to GitHub β User Settings β Developer settings β GitHub Apps β New GitHub App. (Ref. GitHub Docs) Select appropriate permissions as described above. ** After creating the GitHub App, generate a private key by navigating to the app's settings β Private keys β Generate new key. The .pem file will be automatically downloaded. Run the following command to generate base64 encoded private key:
cat <path-to-pem-file> | base64The encoded token will be printed to your terminal. Copy it and delete the private key file. Next, create two new organization secrets, one for the App ID, and the other for the encoded private key. Name them according to the above table. Finally, in "Install App", select your organization to install the app.
Required Variables#
| Variable Name | Description | Example Value |
|---|---|---|
RELEASE_CONFIGURATION |
dc-release framework configuration (YAML) | See example below |
AZURE_IAC_AGENT |
Azure Infrastructure as Code Agent configuration (JSON) | See example below |
DATABRICKS_DATACORE |
Databricks DataCore workspace configuration (JSON) | See example below |
RELEASE_CONFIGURATION Variable
version: 2.0.0
name: YourApplicationName
github_organization:
owner: YourGitHubOrg
requirements:
git_system: github
repository_name: your-requirements
logs:
git_system: github
repository_name: your-release-log
branch: log
AZURE_IAC_AGENT Variable
{
"dev": {
"tenant_id": "12345678-1234-1234-1234-123456789abc",
"client_id": "87654321-4321-4321-4321-abcdef123456"
},
"tst": {
"tenant_id": "12345678-1234-1234-1234-123456789abc",
"client_id": "abcdef12-3456-7890-abcd-ef1234567890"
}
}
DATABRICKS_DATACORE Variable
{
"dev": {
"workspaces": {
"main": {
"host": "https://adb-1234567890123456.16.azuredatabricks.net/",
"teams_webhook_id": ""
}
},
"manage_group_name": "DC-NN-staffs_datacore-DEV_Contributor",
"service_principals": {
"ingest": "12345678-1234-1234-1234-123456789abc"
}
},
"tst": {
"workspaces": {
"main": {
"host": "https://adb-1234567890123456.16.azuredatabricks.net/",
"teams_webhook_id": ""
}
},
"manage_group_name": "DC-NN-staffs_datacore-TST_Contributor",
"service_principals": {
"ingest": "87654321-4321-4321-4321-abcdef123456"
}
}
}
Legacy Approach (Not Recommended)#
While possible, adding secrets and variables through GitHub Web UI in Settings β Secrets and variables β Actions β Secrets/Variables should only be used for temporary testing.
Expected Outcome: Your project has all the necessary configurations and credentials setup
Step 7: Set Up Azure Federated Credentials#
Configure federated authentication for GitHub Actions to deploy to Azure:
- Navigate to Azure Portal β App Registrations
- Select your Databricks service principal
- Go to Certificates & secrets β Federated Credentials β Add Credential
- Configure the credential:
- Federated credential scenario: Other issuer
- Issuer:
https://token.actions.githubusercontent.com- Subject identifier:repo:<org-name>/<repo-name>:ref:refs/heads/*- Name: Descriptive name (e.g.,github-actions-main) - Audience:api://AzureADTokenExchange
Permissions Required
Only App Owners can add federated credentials. Contact your Azure administrator if you don't have access.
Expected Outcome: GitHub Actions can authenticate to Azure with your service principal.
Step 8: Push to GitHub#
If you didn't push during step 5, add the GitHub repository as your remote origin:
git remote add origin <repository-url>
Replace <repository-url> with your GitHub repository URL (either HTTPS or SSH format).
Push your local commits to the remote repository:
git push -u origin main
Note
The -u flag sets the upstream branch, so future pushes can be done with just git push.
Expected Outcome: Your changes are pushed to the remote GitHub repository.
Success Metrics & Checkpoints#
- Project Created: Core and Python templates successfully applied
- Git Initialized: Local repository created with initial commit
- Remote Repository: Code pushed to GitHub successfully
- Secrets Configured: All required GitHub secrets and variables set
- Azure Authentication: Federated credentials configured for service principal
- CI/CD Ready: GitHub Actions workflows can trigger successfully
Project Structure Overview#
After completing these steps, your project will have the following structure:
my-project/
βββ .github/
β βββ workflows/
β βββ main.yaml
β βββ deploy_and_test.yaml
β βββ check_code_quality.yaml
β βββ acceptance_test.yaml
βββ bundle/
β βββmappings/
β βββ bundle.yaml
β βββ permissions.yaml
β βββ targets.yaml
β βββ variables.yaml
βββ documentation/ # dc-release documentation
β βββ functional_specification.md
β βββ design_specification.md
β βββ test_strategy.md
β βββ risk_assessment.md
β βββ operations_manual.md
| βββ recovery_procedure.md
βββ databricks.yml # Databricks Asset Bundle
βββ .copier-answers.core.yml # Core template answers
βββ .copier-answers.python.yml # Python template answers
βββ README.md
Common Challenges & Solutions
Challenge: Copier Command Not Found
Solution
Install Copier using pipx install copier==9.*
Ensure pipx is installed first: python3 -m pip install --user pipx
Challenge: GitHub Actions Failing with Authentication Errors
Solution
Verify federated credentials are correctly configured in Azure Portal
Challenge: ADO Authentication Error during copier copy
```text
error: Failed to fetch: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/databricks-sdk/
Caused by: Missing credentials for ...
```
Solution
- Create a Personal Access Token (PAT) with the Packaging: Read scope (see instructions).
- Export your token before running Copier:
export AZURE_DEVOPS_EXT_PAT="your Token"
Challenge: Keyring prompt blocks Copier
> datacraft:
password_env_var: AZURE_DEVOPS_EXT_PAT
url: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/
username_for_keyring: VssSessionToken
Solution
Press Esc and then Enter to dismiss the prompt.
Next Steps#
Now that your project is set up, you can proceed with:
- Work with your project locally - Set up and deploy your Databricks Pipeline from your local machine
- Build Your Data Product - Implement your data pipelines
- Deploy to Databricks - Learn about the deployment process