Databricks Asset Bundle#
A Databricks Asset Bundle is a tool for packaging all components of a data or AI project—including source code, Databricks resources (like jobs and pipelines), and infrastructure configurations—into a single, deployable unit defined by source files and YAML configurations. Bundles enable software engineering best practices such as source control, code review, testing, and CI/CD, making it easier to manage, develop, and deploy projects consistently across different environments like development, staging, and production.
Why use Asset Bundles?#
- Standardization: Unified framework for assets, code, and configurations
- Reproducibility: Seamless transitions between development, staging, and production
- Collaboration: Enables all team members to contribute and track changes
- Automation: Supports validation, deployment, and CI/CD integration for operational efficiency
- Compliance: Meets versioning, audit, and regulatory requirements
Core concepts and components#
Centralized configuration with databricks.yml#
The databricks.yml file is the single source of truth for your bundle, defining your jobs, tasks, clusters, environment targets, and artifacts.
Bundle name definition
bundle:
name: dab-demo
The bundle section defines the name of your project. For example, dab-demo serves as a clear identifier for this Databricks bundle.
Permissions configuration#
The permissions section in the YAML defines access control for the Databricks project. It specifies who can manage or run the resources in the bundle.
Permissions configuration in databricks.yml
permissions:
- user_name: adminmixg@novonordisk.com # Assigns permissions to a specific user
level: CAN_MANAGE # Full control (create, modify, delete, run)
# - group_name: NN-MY-GROUP # (Optional) Assign permissions to a group for scalability
# level: CAN_MANAGE # Full control for the group
Permission levels:
CAN_MANAGE: Full control (create, modify, delete, run)CAN_RUN: Run onlyCAN_VIEW: Read-only
This ensures proper access control and supports user or group-specific permissions.
Project resources#
Databricks allows you to define all project resources—jobs, clusters, tasks, and even notebooks—under the resources section of the databricks.yml file. This keeps your project well-organized and easy to manage, all in one place.
Defining jobs and notebooks in databricks.yml
resources:
jobs:
etl_job:
name: elt_job
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 1
data_security_mode: USER_ISOLATION
tasks:
- task_key: task_landing_bronze
job_cluster_key: job_cluster
python_wheel_task:
package_name: pipelines
entry_point: landing_to_bronze
libraries:
- whl: ./dist/*.whl
notebooks:
- path: /path/to/notebook_folder/notebook_1.py
format: SOURCE
- path: /path/to/notebook_folder/notebook_2.ipynb
format: JUPYTER
- Jobs: Define jobs with clusters, tasks, and dependencies. Example above runs a Python function from a packaged wheel file.
- Notebooks: List notebooks as resources, specifying their format (
SOURCEorJUPYTER).
Artifacts#
Package Python code as wheels (using Poetry or similar), ensuring consistent deployment of code and dependencies. These artifacts guarantee that the same code, dependencies, and logic are used across different tasks and environments.
Artifacts configuration in databricks.yml
artifacts:
default:
type: whl
build: poetry build # Uses 'poetry build' to create the .whl file in dist/
path: . # Code to package is in the current directory (where pyproject.toml/setup.py lives)
# You can replace 'poetry build' with other tools like Hatch or UV
# The build process generates a .whl file in dist/, referenced in jobs/tasks elsewhere in your Databricks project
Environment targets#
Enable multiple deployment environments (dev, staging, prod), with parameters and versioning to keep environments in sync.
Environment targets configuration in databricks.yml
targets:
dev:
mode: development
default: true
presets:
artifacts_dynamic_version: true
workspace:
host: https://adb-2611332145766578.18.azuredatabricks.net/
- mode: The environment (dev in this case)
- host: The workspace URL where the environment is set up
- artifacts_dynamic_version: Automatic versioning for artifacts to keep track of deployments
This allows you to switch between environments easily, ensuring proper testing and promotions.
Secrets management#
Secrets like passwords, API keys, and other sensitive information can be stored securely in Databricks Secret Scopes and referenced directly in your code.
Accessing secrets in your code
# Accessing secrets in your code:
dbutils.secrets.get(scope="my-secret-scope", key="my-key")
Secrets are stored in scopes (e.g., my-secret-scope) and retrieved using dbutils.secrets.get. This ensures sensitive data is not hardcoded in your configurations, keeping your project safe.
Testing and documentation#
Include unit/integration tests and markdown docs in your bundle to improve reliability and ease onboarding.
Deployment methods#
Once you have defined your Databricks Asset Bundle with all necessary components, you can deploy and manage it using two primary methods: the Command-Line Interface (CLI) and the Databricks User Interface (UI).
Command-Line Interface (CLI)#
The Databricks CLI is a powerful command-line tool that simplifies the management of Databricks resources by wrapping the REST APIs into easy-to-use commands. Using the CLI allows for automated, repeatable, and consistent workflows, making it ideal for team collaboration and CI/CD integration.
Why use the CLI?
- Automation: Speed up repetitive tasks and reduce manual effort
- Consistency: Maintain uniformity across deployments in different environments (dev, staging, prod)
- Flexibility: Use it across various operating systems and integrate seamlessly with CI/CD pipelines
- Control: Gain full access to nearly every function in the Databricks workspace
Common CLI commands#
| Command | Description |
|---|---|
databricks bundle init |
Initialize a new asset bundle project using templates |
databricks bundle validate |
Validate your bundle configuration to check for errors before deployment |
databricks bundle deploy --target <env> |
Deploy your bundle to the target environment (e.g., dev, staging, prod) |
databricks bundle run <resource-key> |
Execute a specific job, pipeline, or task defined in your bundle |
databricks bundle summary |
Get a summary of the bundle's resources and their status |
Note: The CLI is well-suited for automated deployment pipelines, providing reliability and scalability in production environments.
User Interface (UI)#
The Databricks workspace UI offers an interactive alternative for deploying and managing bundles. This is especially useful for new users or when visual inspection and direct edits are needed.
Features and usage:
- Quick testing or iteration for small updates or debugging
- Visual inspection to review bundle contents and ensure correctness
- Learning and demonstrations for onboarding or showcasing workflows
- Upload bundles: Import asset bundles directly from your local machine
- Edit configurations: Modify cluster settings, permissions, or tasks within the workspace
- Resource monitoring: Visualize and monitor resource usage and job execution
Workflow for bundle creation and deployment#
Recommended approach: Copier template#
This approach leverages Copier to bootstrap the project structure and jumpstart the development process. It's ideal for teams looking to standardize their workflows and minimize redundant setup tasks. For those who need to customize templates for specific requirements, see working with templates locally.
Workflow steps:
- Bootstrap with Copier Template: Quickly generate the required project files and structure
- Review or Modify Configuration Files: Validate and customize settings like
databricks.ymlto add specific requirements (e.g., pipelines, clusters) - Proceed to Local Development: Focus on coding Python libraries, notebooks, and other project requirements
- Validate and Deploy: Use the Databricks CLI for validation and deployment into the environment
- CI/CD Integration: Automate validation, testing, and deployment pipelines for greater scalability
- Monitor and Maintain: Continuously monitor Databricks jobs and resources while optimizing configurations
Manual approach: Custom setup#
For projects with unique requirements or flexibility, the manual setup ensures complete control over the structure and configuration. However, this method also involves more initial effort and relies entirely on developer input for consistency.
Workflow steps:
- Start with Manual Setup: Begin by manually creating the necessary directories and structure for the project
- Create Configuration Files: Define essential files like
databricks.ymland manage all settings manually without templating - Proceed to Local Development: Work on adding scripts, notebooks, and dependencies
- Validate and Deploy: Test configurations and deploy using the Databricks CLI
- CI/CD Integration: As in the recommended approach, automation can be added later for enhanced efficiency
- Monitor and Maintain: Check job performance, refine workflows, and ensure the smooth operation of deployed bundles