Skip to content

Setting Up Your Project#

This guide will help you quickly scaffold a new data engineering project using the Data Pipeline Templates.

Overview#

Copier is a library and command-line tool for rendering project templates. It allows you to generate a new project structure from the template repositories, making it easy to start new projects with consistent structure and best practices.

The Data Engineering templates follow a modular architecture where you apply templates in sequence based on your project needs:

  1. dc-template-core: Base project structure with CI/CD skeleton and Databricks Asset Bundle
  2. dc-template-python-project: Python project setup with package management
  3. dc-template-data-product: Data product publishing and quality validation capabilities

What You Will Learn#

  • How to scaffold a complete data engineering project using templates
  • How to configure GitHub Actions for automated deployment
  • How to connect your project to Azure Databricks

Prerequisites#

Required Access#

  • Access to innersource-nn GitHub organization where template repositories reside (NovoAccess Entitlement GitHub : NN-GitHub-Users)
  • Access to a Databricks workspace where you can deploy and run your pipelines. For experimentation, you can use ToyHub, for your project follow this guide to create or get access to an NNDCP Databricks environment.
  • Service Principal with permissions to deploy to Databricks. This is created by the NNDCP team when you create a managed Databricks environment. Read more here.

Required Tools#

Windows

Python

Recommended python version is Python 3.11+

pipx

py -m pip install --user pipx
py -m pipx ensurepath
::   Restart your Terminal.

Hatch:

pipx install "hatch<1.16.0"
pipx install hatch-pip-compile

Azure CLI:

pip install azure-cli

Databricks CLI:

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

Copier:

pipx install copier=="9.*"
pipx inject copier copier-templates-extensions

Good to have:

Mac

Python

Recommended python version is Python 3.11+

pipx

brew install pipx
pipx ensurepath
source ~/.zshrc

Hatch:

pipx install "hatch<1.16.0"
pipx install hatch-pip-compile

Azure CLI:

brew install azure-cli

Databricks CLI:

brew tap databricks/tap
brew install databricks

Copier:

pipx install copier=="9.*"
pipx inject copier copier-templates-extensions

Good to have:

Required Setup#

Follow the setup as outlined in dc-release user guide

  • -requirements repo to capture the business solution requirements
  • -release-log repo where the release notes will be published by the CICD pipeline
  • 2 GitHub apps/service principals: one with read permissions on your repos and the other with write permissions on -release-log repo to publish release notes

Knowledge Prerequisites#

  • Basic Git commands
  • GitHub Actions and workflows
  • Understanding of Python virtual environments
  • Familiarity with Databricks

Step-by-Step Process#

Quick Trial (Sample/Training Use Only)

If you want to quickly try out the templates without needing to provision a Databricks Environment, you can follow the Quickstart Guide to use the templates in ToyHub Environment .

Step 1: Verify Copier Installation#

Confirm Copier is installed and available (see Required Tools above for installation):

copier --version

Step 2: Generate Core Project Structure#

Create your project directory and apply the core template:

copier copy --trust git@github.com:innersource-nn/dc-template-core.git my-project

Replace my-project with your preferred local path.

Copier will prompt you for project-specific information:

  • project_name: Your project and repository name (lowercase, hyphens only)
  • databricks_host: Your Databricks workspace URL (e.g., https://adb-xxx.azuredatabricks.net/)
  • classic_compute: Whether you need classic Databricks compute or Serverless compute
  • databricks_runtime_version: Verify the package versions based on Compute selected
  • databricks_runtime: Default selected version based on user provided input for databricks_runtime_version, please keep the default properties only and do not modify
  • development_target__manage_group_name: Azure AD group name that should have CAN_MANAGE permissions for development bundles.
  • use_microsoft_teams_notifications: Would you like to receive Databricks job status notifications on Microsoft Teams?
  • development_target__teams_webhook_id: Microsoft Teams webhook ID for Databricks job run notifications during development.
  • apply_other_templates: Gives users option to initialise other templates, takes a boolean value (default is true)
  • templates: If the provided value for apply_other_templates is true then the user can opt for initialising other templates (e.g. python and marketplace) for their project

Example Prompts

🎀 project_name
   cp-ingest-coupa
🎀 databricks_host
   <https://adb-4321798407796733.13.azuredatabricks.net/>
🎀 classic_compute
   No
🎀 databricks_runtime_version
   Serverless environment version 4
🎀 development_target__manage_group_name
   your_ad_group_name
🎀 use_microsoft_teams_notifications
   true
🎀 development_target__teams_webhook_id
   123456
🎀 apply_other_templates
   true
🎀 templates
   [dc-template-python-project, dc-template-data-product]

Expected Outcome: A project directory with core structure, GitHub Actions workflows, Databricks Asset Bundle file, and documentation templates. If you selected additional templates, they will be automatically applied.

Step 3: Apply Python Project Template#

Note: If you selected dc-template-python-project during the Core Template setup (Step 2), you can skip this step.

Navigate to your project and apply the Python project template if you haven't already:

cd my-project
copier copy --trust git@github.com:innersource-nn/dc-template-python-project.git .

This adds:

  • Python project configuration (pyproject.toml)
  • Package management setup
  • Databricks Asset Bundle configuration for deploying Python wheels
  • Development dependencies and tools

Template Answers

The Python template automatically reads configuration from .copier-answers.core.yml, so you won't be prompted again for previously entered values.

Expected Outcome: Enhanced project with Python packaging capabilities and wheel deployment configuration.

Step 4: Apply Data Product Template#

Note: If you selected dc-template-data-product during the Core Template setup (Step 2), you can skip this step.

If you plan to publish data products to NN Data Marketplace or use data quality validation features, and haven't applied this template yet:

copier copy --trust git@github.com:innersource-nn/dc-template-data-product.git .

Copier will prompt you for project-specific information regarding marketplace and data quality features:

  • enable_marketplace_publishing: Enable publishing to NNDM Data Marketplace? (Y/n)
    (Adds marketplace workflows and data contracts). If you choose Y, you'll be prompted for:

    • servicenow_business_application_id: ServiceNow Business Application ID
    • product_name, product_id, product_description: Basic marketplace metadata
    • data_contract_team_id, data_contract_owner_contact_email
    • data_contract_owner_contact_name: Ownership information
  • enable_quality_validation: Include data quality validation capabilities? (Y/n)
    (Adds validation scripts, libraries, and optional features). If you choose Y, you'll be prompted for:

    • include_soda_checks: Include Soda Core for additional quality checks? (Y/n)
    • enable_databricks_job: Generate Databricks Asset Bundle for automated validation jobs? (Y/n)
    • include_example_notebooks: Include example Databricks notebooks for validation? (Y/n)
  • novo_initials: Your Novo Nordisk initials

  • project_description: Brief description of your data contracts
  • catalog_base_name, schema_name: Databricks catalog and schema configuration

This adds:

  • Data quality framework and Soda validation
  • Data contract templates (if marketplace enabled)
  • GitHub Actions workflow for NNDM publishing (if marketplace enabled)
  • Configuration for data product metadata

NNDM Prerequisites

Before using marketplace publishing features, ensure you've been onboarded to the NNDM platform and have received your API key and CI tool credentials.

Expected Outcome: Project configured for publishing data products with contracts to NN Data Marketplace and data quality validation capabilities.

Step 5: Initialize Git and Create GitHub Repository#

Initialize Git in your project directory:

git init
git add .
git commit -m "Initial commit from data engineering templates"

Create a new repository on GitHub using the GitHub CLI:

gh repo create <org-name>/my-project --private --source=. --push

Or create manually through the GitHub web interface and then:

git remote add origin git@github.com:<org-name>/my-project.git
git branch -M main
git push -u origin main

Expected Outcome: Your project is now version-controlled and pushed to GitHub.

Step 6: Configure GitHub Apps, Secrets and Variables#

The CI/CD pipeline from the templates requires a few secrets and variables for deployment.

We recommend using GitOps (Configuration as Code) to manage secrets and variables, ensuring that changes are peer-reviewed and version-controlled.

GxP Compliance Note

Manual changes to variables through the GitHub Web UI are not recommended for regulated environments as they lack a clear audit trail.

Manage your GitHub repository configurations via DataCore's NovoNordisk-DataCore/github management repository by following these steps:

  1. Fork the NovoNordisk-DataCore/github repository.
  2. Add your repository configuration as a YAML file in /service_requests.
  3. Submit a Pull Request.

Please see an example configuration file in the NovoNordisk-DataCore/github repository for reference.

Required GitHub Apps & Secrets#

You need 2 service principals on GitHub (called GitHub apps), that we call the reader and release-log-writer.

  • reader is able to read from your organization and requires the below permissions:
  • actions: read
  • contents: read
  • metadata: read
  • pull_requests: read
  • deployments: read
  • variables: read
  • release-log-writer is able to write to release-log and requires the below permissions:
  • contents: read_and_write
  • metadata: read
Secret Name Description How to Obtain
GH_APP_ID reader GitHub App ID Create a GitHub App in your organization with the necessary permissions*
GH_PRIVATE_KEY reader GitHub App private key Generate a private key after creating the GitHub App**
GH_RELEASE_LOG_APP_ID release-log-writer GitHub App ID Create a GitHub App in your organization with the necessary permissions*
GH_RELEASE_LOG_PRIVATE_KEY release-log-writer GitHub App private key Generate a private key after creating the GitHub App**
NNDM_API_KEY NNDM API key Provided by the NNDM team during onboarding
ACR_USERNAME Created for NNDM CI-Tool permissions Provided by NNDM team
ACR_PASSWORD Created for NNDM CI-Tool permissions Provided by NNDM team

* To create a GitHub App, go to GitHub β†’ User Settings β†’ Developer settings β†’ GitHub Apps β†’ New GitHub App. (Ref. GitHub Docs) Select appropriate permissions as described above. ** After creating the GitHub App, generate a private key by navigating to the app's settings β†’ Private keys β†’ Generate new key. The .pem file will be automatically downloaded. Run the following command to generate base64 encoded private key:

cat <path-to-pem-file> | base64

The encoded token will be printed to your terminal. Copy it and delete the private key file. Next, create two new organization secrets, one for the App ID, and the other for the encoded private key. Name them according to the above table. Finally, in "Install App", select your organization to install the app.

Required Variables#

Variable Name Description Example Value
RELEASE_CONFIGURATION dc-release framework configuration (YAML) See example below
AZURE_IAC_AGENT Azure Infrastructure as Code Agent configuration (JSON) See example below
DATABRICKS_DATACORE Databricks DataCore workspace configuration (JSON) See example below

RELEASE_CONFIGURATION Variable

version: 2.0.0
name: YourApplicationName
github_organization:
  owner: YourGitHubOrg
requirements:
  git_system: github
  repository_name: your-requirements
logs:
  git_system: github
  repository_name: your-release-log
  branch: log

AZURE_IAC_AGENT Variable

{
  "dev": {
    "tenant_id": "12345678-1234-1234-1234-123456789abc",
    "client_id": "87654321-4321-4321-4321-abcdef123456"
  },
  "tst": {
    "tenant_id": "12345678-1234-1234-1234-123456789abc",
    "client_id": "abcdef12-3456-7890-abcd-ef1234567890"
  }
}

DATABRICKS_DATACORE Variable

{
  "dev": {
    "workspaces": {
      "main": {
        "host": "https://adb-1234567890123456.16.azuredatabricks.net/",
        "teams_webhook_id": ""
      }
    },
    "manage_group_name": "DC-NN-staffs_datacore-DEV_Contributor",
    "service_principals": {
      "ingest": "12345678-1234-1234-1234-123456789abc"
    }
  },
  "tst": {
    "workspaces": {
      "main": {
        "host": "https://adb-1234567890123456.16.azuredatabricks.net/",
        "teams_webhook_id": ""
      }
    },
    "manage_group_name": "DC-NN-staffs_datacore-TST_Contributor",
    "service_principals": {
      "ingest": "87654321-4321-4321-4321-abcdef123456"
    }
  }
}

While possible, adding secrets and variables through GitHub Web UI in Settings β†’ Secrets and variables β†’ Actions β†’ Secrets/Variables should only be used for temporary testing.

Expected Outcome: Your project has all the necessary configurations and credentials setup

Step 7: Set Up Azure Federated Credentials#

Configure federated authentication for GitHub Actions to deploy to Azure:

  1. Navigate to Azure Portal β†’ App Registrations
  2. Select your Databricks service principal
  3. Go to Certificates & secrets β†’ Federated Credentials β†’ Add Credential
  4. Configure the credential: - Federated credential scenario: Other issuer - Issuer: https://token.actions.githubusercontent.com - Subject identifier: repo:<org-name>/<repo-name>:ref:refs/heads/* - Name: Descriptive name (e.g., github-actions-main) - Audience: api://AzureADTokenExchange

Permissions Required

Only App Owners can add federated credentials. Contact your Azure administrator if you don't have access.

Expected Outcome: GitHub Actions can authenticate to Azure with your service principal.

Step 8: Push to GitHub#

If you didn't push during step 5, add the GitHub repository as your remote origin:

git remote add origin <repository-url>

Replace <repository-url> with your GitHub repository URL (either HTTPS or SSH format).

Push your local commits to the remote repository:

git push -u origin main

Note

The -u flag sets the upstream branch, so future pushes can be done with just git push.

Expected Outcome: Your changes are pushed to the remote GitHub repository.

Success Metrics & Checkpoints#

  • Project Created: Core and Python templates successfully applied
  • Git Initialized: Local repository created with initial commit
  • Remote Repository: Code pushed to GitHub successfully
  • Secrets Configured: All required GitHub secrets and variables set
  • Azure Authentication: Federated credentials configured for service principal
  • CI/CD Ready: GitHub Actions workflows can trigger successfully

Project Structure Overview#

After completing these steps, your project will have the following structure:

my-project/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ main.yaml
β”‚       β”œβ”€β”€ deploy_and_test.yaml
β”‚       β”œβ”€β”€ check_code_quality.yaml
β”‚       └── acceptance_test.yaml
β”œβ”€β”€ bundle/
β”‚   └──mappings/
β”‚       β”œβ”€β”€ bundle.yaml
β”‚       β”œβ”€β”€ permissions.yaml
β”‚       β”œβ”€β”€ targets.yaml
β”‚       └── variables.yaml
β”œβ”€β”€ documentation/                     # dc-release documentation
β”‚   β”œβ”€β”€ functional_specification.md
β”‚   β”œβ”€β”€ design_specification.md
β”‚   β”œβ”€β”€ test_strategy.md
β”‚   β”œβ”€β”€ risk_assessment.md
β”‚   β”œβ”€β”€ operations_manual.md
|   └── recovery_procedure.md
β”œβ”€β”€ databricks.yml                    # Databricks Asset Bundle
β”œβ”€β”€ .copier-answers.core.yml          # Core template answers
β”œβ”€β”€ .copier-answers.python.yml        # Python template answers
└── README.md
Common Challenges & Solutions

Challenge: Copier Command Not Found

Solution

Install Copier using pipx install copier==9.*

Ensure pipx is installed first: python3 -m pip install --user pipx

Challenge: GitHub Actions Failing with Authentication Errors

Solution

Verify federated credentials are correctly configured in Azure Portal

Challenge: ADO Authentication Error during copier copy

```text
error: Failed to fetch: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/databricks-sdk/
Caused by: Missing credentials for ...
```
Solution
  1. Create a Personal Access Token (PAT) with the Packaging: Read scope (see instructions).
  2. Export your token before running Copier:
export AZURE_DEVOPS_EXT_PAT="your Token"

Challenge: Keyring prompt blocks Copier

> datacraft:
   password_env_var: AZURE_DEVOPS_EXT_PAT
   url: https://pkgs.dev.azure.com/novonordiskit/_packaging/datacraft/pypi/simple/
   username_for_keyring: VssSessionToken
Solution

Press Esc and then Enter to dismiss the prompt.

Next Steps#

Now that your project is set up, you can proceed with:

  1. Work with your project locally - Set up and deploy your Databricks Pipeline from your local machine
  2. Build Your Data Product - Implement your data pipelines
  3. Deploy to Databricks - Learn about the deployment process

Additional Resources#