Setting Up Your Project#

This guide will help you quickly scaffold a new data engineering project using the Data Pipeline Templates.

Overview#

Copier is a library and command-line tool for rendering project templates. It allows you to generate a new project structure from the template repositories, making it easy to start new projects with consistent structure and best practices.

The Data Engineering templates follow a modular architecture where you apply templates in sequence based on your project needs:

dc-template-core: Base project structure with CI/CD skeleton and Databricks Asset Bundle
dc-template-python-project: Python project setup with package management
dc-template-data-product: Data product publishing and quality validation capabilities

What You Will Learn#

How to scaffold a complete data engineering project using templates
How to configure GitHub Actions for automated deployment
How to connect your project to Azure Databricks

Prerequisites#

Required Access#

Access to innersource-nn GitHub organization where template repositories reside (NovoAccess Entitlement GitHub : NN-GitHub-Users)
Access to a Databricks workspace where you can deploy and run your pipelines. For experimentation, you can use ToyHub, for your project follow this guide to create or get access to an NNDCP Databricks environment.
Service Principal with permissions to deploy to Databricks. This is created by the NNDCP team when you create a managed Databricks environment. Read more here.

Required Tools#

Python 3.11+ - Python runtime environment
pipx - Tool installer for Python applications
hatch - Used for deployment
Azure CLI - For Azure authentication
Databricks CLI - Databricks related operations
Copier 9.x - Template rendering tool
Git CLI - Version control system
GitHub CLI - For easy repository creation

Windows

Python

Recommended python version is Python 3.11+

pipx

py -m pip install --user pipx
py -m pipx ensurepath
::   Restart your Terminal.

Hatch:

pipx install "hatch<1.16.0"
pipx install hatch-pip-compile

Azure CLI:

pip install azure-cli

Databricks CLI:

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

Copier:

pipx install copier=="9.*"
pipx inject copier copier-templates-extensions

Good to have:

Git CLI - Version control system
GitHub CLI - For easy repository creation

Mac

Python

Recommended python version is Python 3.11+

pipx

brew install pipx
pipx ensurepath
source ~/.zshrc

Hatch:

pipx install "hatch<1.16.0"
pipx install hatch-pip-compile

Azure CLI:

brew install azure-cli

Databricks CLI:

brew tap databricks/tap
brew install databricks

Copier:

pipx install copier=="9.*"
pipx inject copier copier-templates-extensions

Good to have:

Git CLI - Version control system
GitHub CLI - For easy repository creation

Required Setup#

Follow the setup as outlined in dc-release user guide

-requirements repo to capture the business solution requirements
-release-log repo where the release notes will be published by the CICD pipeline
2 GitHub apps/service principals: one with read permissions on your repos and the other with write permissions on -release-log repo to publish release notes

Knowledge Prerequisites#

Basic Git commands
GitHub Actions and workflows
Understanding of Python virtual environments
Familiarity with Databricks

Step-by-Step Process#

Quick Trial (Sample/Training Use Only)

If you want to quickly try out the templates without needing to provision a Databricks Environment, you can follow the Quickstart Guide to use the templates in ToyHub Environment .

Step 1: Verify Copier Installation#

Confirm Copier is installed and available (see Required Tools above for installation):

copier --version

Step 2: Generate Core Project Structure#

Create your project directory and apply the core template:

copier copy --trust git@github.com:innersource-nn/dc-template-core.git my-project

Replace my-project with your preferred local path.

Copier will prompt you for project-specific information:

project_name: Your project and repository name (lowercase, hyphens only)
databricks_host: Your Databricks workspace URL (e.g., https://adb-xxx.azuredatabricks.net/)
classic_compute: Whether you need classic Databricks compute or Serverless compute
databricks_runtime_version: Verify the package versions based on Compute selected
databricks_runtime: Default selected version based on user provided input for databricks_runtime_version, please keep the default properties only and do not modify
development_target__manage_group_name: Azure AD group name that should have CAN_MANAGE permissions for development bundles.
use_microsoft_teams_notifications: Would you like to receive Databricks job status notifications on Microsoft Teams?
development_target__teams_webhook_id: Microsoft Teams webhook ID for Databricks job run notifications during development.
apply_other_templates: Gives users option to initialise other templates, takes a boolean value (default is true)
templates: If the provided value for apply_other_templates is true then the user can opt for initialising other templates (e.g. python and marketplace) for their project

Example Prompts

🎤 project_name
   cp-ingest-coupa
🎤 databricks_host
   <https://adb-4321798407796733.13.azuredatabricks.net/>
🎤 classic_compute
   No
🎤 databricks_runtime_version
   Serverless environment version 4
🎤 development_target__manage_group_name
   your_ad_group_name
🎤 use_microsoft_teams_notifications
   true
🎤 development_target__teams_webhook_id
   123456
🎤 apply_other_templates
   true
🎤 templates
   [dc-template-python-project, dc-template-data-product]

Expected Outcome: A project directory with core structure, GitHub Actions workflows, Databricks Asset Bundle file, and documentation templates. If you selected additional templates, they will be automatically applied.

Step 3: Apply Python Project Template#

Note: If you selected dc-template-python-project during the Core Template setup (Step 2), you can skip this step.

Navigate to your project and apply the Python project template if you haven't already:

cd my-project
copier copy --trust git@github.com:innersource-nn/dc-template-python-project.git .

This adds:

Python project configuration (pyproject.toml)
Package management setup
Databricks Asset Bundle configuration for deploying Python wheels
Development dependencies and tools

Template Answers

The Python template automatically reads configuration from .copier-answers.core.yml, so you won't be prompted again for previously entered values.

Expected Outcome: Enhanced project with Python packaging capabilities and wheel deployment configuration.

Step 4: Apply Data Product Template#

Note: If you selected dc-template-data-product during the Core Template setup (Step 2), you can skip this step.

If you plan to publish data products to NN Data Marketplace or use data quality validation features, and haven't applied this template yet:

copier copy --trust git@github.com:innersource-nn/dc-template-data-product.git .

Copier will prompt you for project-specific information regarding marketplace and data quality features:

enable_marketplace_publishing: Enable publishing to NNDM Data Marketplace? (Y/n)
(Adds marketplace workflows and data contracts). If you choose Y, you'll be prompted for:
- servicenow_business_application_id: ServiceNow Business Application ID
- product_name, product_id, product_description: Basic marketplace metadata
- data_contract_team_id, data_contract_owner_contact_email
- data_contract_owner_contact_name: Ownership information
enable_quality_validation: Include data quality validation capabilities? (Y/n)
(Adds validation scripts, libraries, and optional features). If you choose Y, you'll be prompted for:
- include_soda_checks: Include Soda Core for additional quality checks? (Y/n)
- enable_databricks_job: Generate Databricks Asset Bundle for automated validation jobs? (Y/n)
- include_example_notebooks: Include example Databricks notebooks for validation? (Y/n)
novo_initials: Your Novo Nordisk initials
project_description: Brief description of your data contracts
catalog_base_name, schema_name: Databricks catalog and schema configuration

This adds:

Data quality framework and Soda validation
Data contract templates (if marketplace enabled)
GitHub Actions workflow for NNDM publishing (if marketplace enabled)
Configuration for data product metadata

NNDM Prerequisites

Before using marketplace publishing features, ensure you've been onboarded to the NNDM platform and have received your API key and CI tool credentials.

Expected Outcome: Project configured for publishing data products with contracts to NN Data Marketplace and data quality validation capabilities.

Step 5: Initialize Git and Create GitHub Repository#

Initialize Git in your project directory:

git init
git add .
git commit -m "Initial commit from data engineering templates"

Create a new repository on GitHub using the GitHub CLI:

gh repo create <org-name>/my-project --private --source=. --push

Or create manually through the GitHub web interface and then:

git remote add origin git@github.com:<org-name>/my-project.git
git branch -M main
git push -u origin main

Expected Outcome: Your project is now version-controlled and pushed to GitHub.

Step 6: Configure GitHub Apps, Secrets and Variables#

The CI/CD pipeline from the templates requires a few secrets and variables for deployment.

We recommend using GitOps (Configuration as Code) to manage secrets and variables, ensuring that changes are peer-reviewed and version-controlled.

GxP Compliance Note

Manual changes to variables through the GitHub Web UI are not recommended for regulated environments as they lack a clear audit trail.

Manage your GitHub repository configurations via DataCore's NovoNordisk-DataCore/github management repository by following these steps:

Fork the NovoNordisk-DataCore/github repository.
Add your repository configuration as a YAML file in /service_requests.
Submit a Pull Request.

Please see an example configuration file in the NovoNordisk-DataCore/github repository for reference.

Required GitHub Apps & Secrets#

You need 2 service principals on GitHub (called GitHub apps), that we call the reader and release-log-writer.

reader is able to read from your organization and requires the below permissions:
actions: read
contents: read
metadata: read
pull_requests: read
deployments: read
variables: read
release-log-writer is able to write to release-log and requires the below permissions:
contents: read_and_write
metadata: read

Secret Name	Description	How to Obtain
`GH_APP_ID`	`reader` GitHub App ID	Create a GitHub App in your organization with the necessary permissions*
`GH_PRIVATE_KEY`	`reader` GitHub App private key	Generate a private key after creating the GitHub App**
`GH_RELEASE_LOG_APP_ID`	`release-log-writer` GitHub App ID	Create a GitHub App in your organization with the necessary permissions*
`GH_RELEASE_LOG_PRIVATE_KEY`	`release-log-writer` GitHub App private key	Generate a private key after creating the GitHub App**
`NNDM_API_KEY`	NNDM API key	Provided by the NNDM team during onboarding
`ACR_USERNAME`	Created for NNDM CI-Tool permissions	Provided by NNDM team
`ACR_PASSWORD`	Created for NNDM CI-Tool permissions	Provided by NNDM team

* To create a GitHub App, go to GitHub → User Settings → Developer settings → GitHub Apps → New GitHub App. (Ref. GitHub Docs) Select appropriate permissions as described above. ** After creating the GitHub App, generate a private key by navigating to the app's settings → Private keys → Generate new key. The .pem file will be automatically downloaded. Run the following command to generate base64 encoded private key:

cat <path-to-pem-file> | base64

The encoded token will be printed to your terminal. Copy it and delete the private key file. Next, create two new organization secrets, one for the App ID, and the other for the encoded private key. Name them according to the above table. Finally, in "Install App", select your organization to install the app.

Required Variables#

Variable Name	Description	Example Value
`RELEASE_CONFIGURATION`	dc-release framework configuration (YAML)	See example below
`AZURE_IAC_AGENT`	Azure Infrastructure as Code Agent configuration (JSON)	See example below
`DATABRICKS_DATACORE`	Databricks DataCore workspace configuration (JSON)	See example below

RELEASE_CONFIGURATION Variable

version: 2.0.0
name: YourApplicationName
github_organization:
  owner: YourGitHubOrg
requirements:
  git_system: github
  repository_name: your-requirements
logs:
  git_system: github
  repository_name: your-release-log
  branch: log

AZURE_IAC_AGENT Variable

{
  "dev": {
    "tenant_id": "12345678-1234-1234-1234-123456789abc",
    "client_id": "87654321-4321-4321-4321-abcdef123456"
  },
  "tst": {
    "tenant_id": "12345678-1234-1234-1234-123456789abc",
    "client_id": "abcdef12-3456-7890-abcd-ef1234567890"
  }
}

DATABRICKS_DATACORE Variable

{
  "dev": {
    "workspaces": {
      "main": {
        "host": "https://adb-1234567890123456.16.azuredatabricks.net/",
        "teams_webhook_id": ""
      }
    },
    "manage_group_name": "DC-NN-staffs_datacore-DEV_Contributor",
    "service_principals": {
      "ingest": "12345678-1234-1234-1234-123456789abc"
    }
  },
  "tst": {
    "workspaces": {
      "main": {
        "host": "https://adb-1234567890123456.16.azuredatabricks.net/",
        "teams_webhook_id": ""
      }
    },
    "manage_group_name": "DC-NN-staffs_datacore-TST_Contributor",
    "service_principals": {
      "ingest": "87654321-4321-4321-4321-abcdef123456"
    }
  }
}

Legacy Approach (Not Recommended)#

While possible, adding secrets and variables through GitHub Web UI in Settings → Secrets and variables → Actions → Secrets/Variables should only be used for temporary testing.

Expected Outcome: Your project has all the necessary configurations and credentials setup

Step 7: Set Up Azure Federated Credentials#

Configure federated authentication for GitHub Actions to deploy to Azure:

Navigate to Azure Portal → App Registrations
Select your Databricks service principal
Go to Certificates & secrets → Federated Credentials → Add Credential
Configure the credential: - Federated credential scenario: Other issuer - Issuer: https://token.actions.githubusercontent.com - Subject identifier: repo:<org-name>/<repo-name>:ref:refs/heads/* - Name: Descriptive name (e.g., github-actions-main) - Audience: api://AzureADTokenExchange

Permissions Required

Only App Owners can add federated credentials. Contact your Azure administrator if you don't have access.

Expected Outcome: GitHub Actions can authenticate to Azure with your service principal.

Step 8: Push to GitHub#

If you didn't push during step 5, add the GitHub repository as your remote origin:

git remote add origin <repository-url>

Replace <repository-url> with your GitHub repository URL (either HTTPS or SSH format).

Push your local commits to the remote repository:

git push -u origin main

Note

The -u flag sets the upstream branch, so future pushes can be done with just git push.

Expected Outcome: Your changes are pushed to the remote GitHub repository.

Success Metrics & Checkpoints#

Project Created: Core and Python templates successfully applied
Git Initialized: Local repository created with initial commit
Remote Repository: Code pushed to GitHub successfully
Secrets Configured: All required GitHub secrets and variables set
Azure Authentication: Federated credentials configured for service principal
CI/CD Ready: GitHub Actions workflows can trigger successfully

Project Structure Overview#

After completing these steps, your project will have the following structure:

my-project/
├── .github/
│   └── workflows/
│       ├── main.yaml
│       ├── deploy_and_test.yaml
│       ├── check_code_quality.yaml
│       └── acceptance_test.yaml
├── bundle/
│   └──mappings/
│       ├── bundle.yaml
│       ├── permissions.yaml
│       ├── targets.yaml
│       └── variables.yaml
├── documentation/                     # dc-release documentation
│   ├── functional_specification.md
│   ├── design_specification.md
│   ├── test_strategy.md
│   ├── risk_assessment.md
│   ├── operations_manual.md
|   └── recovery_procedure.md
├── databricks.yml                    # Databricks Asset Bundle
├── .copier-answers.core.yml          # Core template answers
├── .copier-answers.python.yml        # Python template answers
└── README.md

Common Challenges & Solutions

Challenge: Copier Command Not Found

Solution

Install Copier using pipx install copier==9.*

Ensure pipx is installed first: python3 -m pip install --user pipx

Challenge: GitHub Actions Failing with Authentication Errors

Solution

Verify federated credentials are correctly configured in Azure Portal

Challenge: ADO Authentication Error during copier copy

```text
error: Failed to fetch: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/databricks-sdk/
Caused by: Missing credentials for ...
```

Solution

Create a Personal Access Token (PAT) with the Packaging: Read scope (see instructions).
Export your token before running Copier:

export AZURE_DEVOPS_EXT_PAT="your Token"

Challenge: Keyring prompt blocks Copier

> datacraft:
   password_env_var: AZURE_DEVOPS_EXT_PAT
   url: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/
   username_for_keyring: VssSessionToken

Solution

Press Esc and then Enter to dismiss the prompt.

Next Steps#

Now that your project is set up, you can proceed with:

Work with your project locally - Set up and deploy your Databricks Pipeline from your local machine
Build Your Data Product - Implement your data pipelines
Deploy to Databricks - Learn about the deployment process

Setting Up Your Project#

Overview#

What You Will Learn#

Prerequisites#

Required Access#

Required Tools#

Required Setup#

Knowledge Prerequisites#

Step-by-Step Process#

Step 1: Verify Copier Installation#

Step 2: Generate Core Project Structure#

Step 3: Apply Python Project Template#

Step 4: Apply Data Product Template#

Step 5: Initialize Git and Create GitHub Repository#

Step 6: Configure GitHub Apps, Secrets and Variables#

Required GitHub Apps & Secrets#

Required Variables#

Legacy Approach (Not Recommended)#

Step 7: Set Up Azure Federated Credentials#

Step 8: Push to GitHub#

Success Metrics & Checkpoints#

Project Structure Overview#

Next Steps#

Additional Resources#