Orchestration & Monitoring#
This guide covers how orchestration, scheduling, and monitoring can be configured in Databricks using Databricks Asset Bundles.
Reader's Guide: This guide covers Databricks orchestration fundamentals - jobs, tasks, and triggers - followed by practical examples using (Databricks Asset Bundles) to create and deploy workflows with Python. Databricks Asset Bundles is a way to specify resources in Databricks using YAML, JSON or python files.
The guide consists of two main sections: the first section covering the core concepts, and the second section covering orchestration with Databricks Asset Bundles.
Core Concepts#
Lakeflow Jobs#
Lakeflow Jobs (jobs) are used to create and automate workflows in Databricks. It could simply be to schedule a single task or create a multitask workflow.
You can specify different properties for the job such as the ones below:
- Trigger - this defines when to run the job.
- Parameters - run-time parameters that are automatically pushed to tasks within the job.
- Notifications - emails or webhooks to be sent when a job fails or takes too long which can be used for monitoring.
- Git - Version control of the source code that the the job/task should run.
Task#
There are many types of tasks, here are a few examples.
- Notebook - run a notebook as a task.
- Python Wheel - run a Python wheel as a task.
- dbt - run a dbt command(s) as a task.
Trigger#
A trigger is used to start the execution of a job. It can either be scheduled or event-based so the job is triggered when a new file arrives in the cloud-storage.
- Event Based
file_arrival:
url: "/Volumes/catalog/stg/sources/file_location"
min_time_between_triggers_seconds: 60 # 60 seconds minimum between runs
wait_after_last_change_seconds: 60 # Wait 60 seconds after last file change
- Scheduled
periodic:
interval: 1
unit: WEEKS
Orchestration with Databricks Asset Bundles#
The second part of the guide describes how a basic workflow with two tasks can be created using Databricks Asset Bundles with python.
1. Create a new Databricks project by running the following commands.
Authenticate by running databricks configure or databricks auth login --host https://adb-XXXX.azuredatabricks.net
2. Now create a new project by running the command below:
databricks bundle init experimental-jobs-as-code
When prompted to include the different stubs, just say yes to all of them.
3. Now run databricks bundle deploy --target dev
If you go to the Databricks UI under 'Jobs & Pipelines' you can see that the job has been deployed. If you click on the Job you can see the workflow
If you open the file resources\jobs_as_code_project_job.py you can see the code that defines the job.
4. By taking a closes look at the job, we can see that the consists of two tasks, one task that runs a notebook and one task that runs a python wheel. It is scheduled to run every day as defined with the trigger key.
Expand the example below to see the python job definition example:
Example: Python Job Definition Code
from databricks.bundles.jobs import Job
"""
The main job for jobs_as_code_project.
"""
jobs_as_code_project_job = Job.from_dict(
{
"name": "jobs_as_code_project_job",
"trigger": {
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
"periodic": {
"interval": 1,
"unit": "DAYS",
},
},
# "email_notifications": {
# "on_failure": [
# "xxxx@novonordisk.com",
# ],
# },
"tasks": [
{
"task_key": "notebook_task",
"job_cluster_key": "job_cluster",
"notebook_task": {
"notebook_path": "src/notebook.ipynb",
},
},
{
"task_key": "main_task",
"depends_on": [
{
"task_key": "notebook_task",
},
],
"job_cluster_key": "job_cluster",
"python_wheel_task": {
"package_name": "jobs_as_code_project",
"entry_point": "main",
},
"libraries": [
# By default we just include the .whl file generated for the jobs_as_code_project package.
# See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
# for more information on how to add other libraries.
{
"whl": "dist/*.whl",
},
],
},
],
"job_clusters": [
{
"job_cluster_key": "job_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"data_security_mode": "SINGLE_USER",
"autoscale": {
"min_workers": 1,
"max_workers": 4,
},
},
},
],
}
)
Here is a breakdown of the key orchestration features used in the example above:
Scheduling:
- The
triggersection usesperiodicscheduling withinterval: 1andunit: "DAYS" - Job runs exactly one day after the previous run completes
Task Dependencies:
main_taskhas adepends_onconfiguration referencingnotebook_task- This creates orchestration dependency -
main_taskonly runs afternotebook_tasksucceeds - Dependencies ensure proper execution order and prevent processing incomplete data
Task Types:
-
Notebook Task: Runs the notebook at
src/notebook.ipynb -
Python Wheel Task: Executes packaged Python code with entry point
main -
Where both tasks share the same cluster via
job_cluster_key
Cluster Configuration:
- Auto-scaling cluster (1-4 workers) defined in
job_clusters - Shared cluster reduces costs while maintaining task isolation
- Cluster terminates automatically after job completion
This concludes the guide.
Official Docs:
[1] https://docs.databricks.com/aws/en/jobs/
[2] https://docs.databricks.com/aws/en/dev-tools/bundles/pipelines-tutorial
[3] https://docs.databricks.com/aws/en/dev-tools/bundles/python/