Skip to content

Databricks Asset Bundle#

A Databricks Asset Bundle is a tool for packaging all components of a data or AI project—including source code, Databricks resources (like jobs and pipelines), and infrastructure configurations—into a single, deployable unit defined by source files and YAML configurations. Bundles enable software engineering best practices such as source control, code review, testing, and CI/CD, making it easier to manage, develop, and deploy projects consistently across different environments like development, staging, and production.

Why use Asset Bundles?#

  • Standardization: Unified framework for assets, code, and configurations
  • Reproducibility: Seamless transitions between development, staging, and production
  • Collaboration: Enables all team members to contribute and track changes
  • Automation: Supports validation, deployment, and CI/CD integration for operational efficiency
  • Compliance: Meets versioning, audit, and regulatory requirements

Core concepts and components#

Centralized configuration with databricks.yml#

The databricks.yml file is the single source of truth for your bundle, defining your jobs, tasks, clusters, environment targets, and artifacts.

Bundle name definition

bundle:
    name: dab-demo

The bundle section defines the name of your project. For example, dab-demo serves as a clear identifier for this Databricks bundle.

Permissions configuration#

The permissions section in the YAML defines access control for the Databricks project. It specifies who can manage or run the resources in the bundle.

Permissions configuration in databricks.yml

permissions:
    - user_name: adminmixg@novonordisk.com  # Assigns permissions to a specific user
      level: CAN_MANAGE                     # Full control (create, modify, delete, run)
#   - group_name: NN-MY-GROUP              # (Optional) Assign permissions to a group for scalability
#     level: CAN_MANAGE                    # Full control for the group

Permission levels:

  • CAN_MANAGE: Full control (create, modify, delete, run)
  • CAN_RUN: Run only
  • CAN_VIEW: Read-only

This ensures proper access control and supports user or group-specific permissions.

Project resources#

Databricks allows you to define all project resources—jobs, clusters, tasks, and even notebooks—under the resources section of the databricks.yml file. This keeps your project well-organized and easy to manage, all in one place.

Defining jobs and notebooks in databricks.yml

resources:
    jobs:
        etl_job:
            name: elt_job
            job_clusters:
                - job_cluster_key: job_cluster
                    new_cluster:
                        spark_version: 15.4.x-scala2.12
                        node_type_id: Standard_DS3_v2
                        num_workers: 1
                        data_security_mode: USER_ISOLATION
            tasks:
                - task_key: task_landing_bronze
                    job_cluster_key: job_cluster
                    python_wheel_task:
                        package_name: pipelines
                        entry_point: landing_to_bronze
                        libraries:
                            - whl: ./dist/*.whl
    notebooks:
        - path: /path/to/notebook_folder/notebook_1.py
            format: SOURCE
        - path: /path/to/notebook_folder/notebook_2.ipynb
            format: JUPYTER
  • Jobs: Define jobs with clusters, tasks, and dependencies. Example above runs a Python function from a packaged wheel file.
  • Notebooks: List notebooks as resources, specifying their format (SOURCE or JUPYTER).

Artifacts#

Package Python code as wheels (using Poetry or similar), ensuring consistent deployment of code and dependencies. These artifacts guarantee that the same code, dependencies, and logic are used across different tasks and environments.

Artifacts configuration in databricks.yml

artifacts:
    default:
        type: whl
        build: poetry build  # Uses 'poetry build' to create the .whl file in dist/
        path: .              # Code to package is in the current directory (where pyproject.toml/setup.py lives)
        # You can replace 'poetry build' with other tools like Hatch or UV
        # The build process generates a .whl file in dist/, referenced in jobs/tasks elsewhere in your Databricks project

Environment targets#

Enable multiple deployment environments (dev, staging, prod), with parameters and versioning to keep environments in sync.

Environment targets configuration in databricks.yml

targets:
    dev:
        mode: development
        default: true
        presets:
            artifacts_dynamic_version: true
        workspace:
            host: https://adb-2611332145766578.18.azuredatabricks.net/
  • mode: The environment (dev in this case)
  • host: The workspace URL where the environment is set up
  • artifacts_dynamic_version: Automatic versioning for artifacts to keep track of deployments

This allows you to switch between environments easily, ensuring proper testing and promotions.

Secrets management#

Secrets like passwords, API keys, and other sensitive information can be stored securely in Databricks Secret Scopes and referenced directly in your code.

Accessing secrets in your code

# Accessing secrets in your code:
dbutils.secrets.get(scope="my-secret-scope", key="my-key")

Secrets are stored in scopes (e.g., my-secret-scope) and retrieved using dbutils.secrets.get. This ensures sensitive data is not hardcoded in your configurations, keeping your project safe.

Testing and documentation#

Include unit/integration tests and markdown docs in your bundle to improve reliability and ease onboarding.

Deployment methods#

Once you have defined your Databricks Asset Bundle with all necessary components, you can deploy and manage it using two primary methods: the Command-Line Interface (CLI) and the Databricks User Interface (UI).

Command-Line Interface (CLI)#

The Databricks CLI is a powerful command-line tool that simplifies the management of Databricks resources by wrapping the REST APIs into easy-to-use commands. Using the CLI allows for automated, repeatable, and consistent workflows, making it ideal for team collaboration and CI/CD integration.

Why use the CLI?

  • Automation: Speed up repetitive tasks and reduce manual effort
  • Consistency: Maintain uniformity across deployments in different environments (dev, staging, prod)
  • Flexibility: Use it across various operating systems and integrate seamlessly with CI/CD pipelines
  • Control: Gain full access to nearly every function in the Databricks workspace

Common CLI commands#

Command Description
databricks bundle init Initialize a new asset bundle project using templates
databricks bundle validate Validate your bundle configuration to check for errors before deployment
databricks bundle deploy --target <env> Deploy your bundle to the target environment (e.g., dev, staging, prod)
databricks bundle run <resource-key> Execute a specific job, pipeline, or task defined in your bundle
databricks bundle summary Get a summary of the bundle's resources and their status

Note: The CLI is well-suited for automated deployment pipelines, providing reliability and scalability in production environments.

User Interface (UI)#

The Databricks workspace UI offers an interactive alternative for deploying and managing bundles. This is especially useful for new users or when visual inspection and direct edits are needed.

Features and usage:

  • Quick testing or iteration for small updates or debugging
  • Visual inspection to review bundle contents and ensure correctness
  • Learning and demonstrations for onboarding or showcasing workflows
  • Upload bundles: Import asset bundles directly from your local machine
  • Edit configurations: Modify cluster settings, permissions, or tasks within the workspace
  • Resource monitoring: Visualize and monitor resource usage and job execution

Workflow for bundle creation and deployment#

Hold "Alt" / "Option" to enable Pan & Zoom
Databricks Asset Bundle Workflow

This approach leverages Copier to bootstrap the project structure and jumpstart the development process. It's ideal for teams looking to standardize their workflows and minimize redundant setup tasks. For those who need to customize templates for specific requirements, see working with templates locally.

Workflow steps:

  1. Bootstrap with Copier Template: Quickly generate the required project files and structure
  2. Review or Modify Configuration Files: Validate and customize settings like databricks.yml to add specific requirements (e.g., pipelines, clusters)
  3. Proceed to Local Development: Focus on coding Python libraries, notebooks, and other project requirements
  4. Validate and Deploy: Use the Databricks CLI for validation and deployment into the environment
  5. CI/CD Integration: Automate validation, testing, and deployment pipelines for greater scalability
  6. Monitor and Maintain: Continuously monitor Databricks jobs and resources while optimizing configurations

Manual approach: Custom setup#

For projects with unique requirements or flexibility, the manual setup ensures complete control over the structure and configuration. However, this method also involves more initial effort and relies entirely on developer input for consistency.

Workflow steps:

  1. Start with Manual Setup: Begin by manually creating the necessary directories and structure for the project
  2. Create Configuration Files: Define essential files like databricks.yml and manage all settings manually without templating
  3. Proceed to Local Development: Work on adding scripts, notebooks, and dependencies
  4. Validate and Deploy: Test configurations and deploy using the Databricks CLI
  5. CI/CD Integration: As in the recommended approach, automation can be added later for enhanced efficiency
  6. Monitor and Maintain: Check job performance, refine workflows, and ensure the smooth operation of deployed bundles

References#