Skip to content

Overview

Go to Playbook Main Page
Next: Ingestion

Introduction#

When dividing these source system types into categories of Pull Mechanism and Fetch Mechanism, here's how they can be categorized based on how data is ingested into Databricks:

This chapter covers the topics of bringing in or receiving data from an external system into the landing zone. There are 2 types of ways to get data into the landing zone.

  • 1) Create an integration within a databricks pipeline to connect to and fetch the data
  • 2) Have the data pushed into the landing zone

This chapter discusses approach 2 where an web api is designed and published for data providers to push their data into the landing zone.

Pull Mechanism#

In this scenario Databricks actively initiates data retrieval via APIs, queries, or batch reads from any of the below source system types.

Source System Types Examples
Databases (Relational and NoSQL) Oracle, SQL Server, PostgreSQL
Data Warehouses Amazon Redshift, Snowflake, SAP BW
File-based Sources CSV, JSON, XML, Parquet, Avro, ORC, Excel stored on cloud or local file systems (e.g., HDFS)
Data Lakes Amazon S3, Azure Data Lake Storage (ADLS), Hadoop Distributed File System
APIs (REST or SOAP) Salesforce, SAP, ServiceNow
Public Datasets or Open Data Platforms Government Open Data Portals
Distributed Systems Read directly from Hadoop DFS or other distributed storage
Cloud Services/Platforms (Batch Reads) Azure Blob Storage, Amazon Glacier
Virtualized Sources Query via Presto
Proprietary Systems (custom solutions) Custom applications where Databricks queries data using APIs or database connectors

Push Mechanism#

In this scenario Databricks listens for and consumes data streams or pushes from sources. The table below gives you a pretty good breakdown of the types of sources that can be used for a pull mechanism.

Source System Types Examples
Streaming Data Sources Apache Kafka, Azure Event Hubs, Apache Flink
Message Queues Amazon SQS
Enterprise Data Integration Platforms (Push) Informatica Fivetran
Cloud Services (Streaming) Confluent
Delta Sharing (Push model) A Databricks-native platform for secure sharing of real-time data
SaaS Applications (Push Events) Salesforce (event notifications), Successfactors

Process Flow#

The process flow is describing what activities / steps should be followed be an engineering when considering the ingestion pipeline. Depending on the desired configuration you would implement one or more of these pipelines

  • 1 - Implement Pipeline - Source to Landing(This will use the pull mechanism, if your solution does not pull data then you will not implement this as your solution will provide another means to get data into the landing zone e.g API)
  • 2 - Implement Pipeline - Landing Zone Bronze(this will only be used if your solution includes storing of the data into the landing zone.)
  • 3 - Implement Pipeline - Direct Ingestion Into Bronze (only this pipeline would be implemented if your solution does not need to store data in the landing zone)
When should i use or skip the landing zone?

Hold "Alt" / "Option" to enable Pan & Zoom
Process Diagram