Overview
Go to Playbook Main Page
Next: Ingestion
Introduction#
When dividing these source system types into categories of Pull Mechanism and Fetch Mechanism, here's how they can be categorized based on how data is ingested into Databricks:
This chapter covers the topics of bringing in or receiving data from an external system into the landing zone. There are 2 types of ways to get data into the landing zone.
- 1) Create an integration within a databricks pipeline to connect to and fetch the data
- 2) Have the data pushed into the landing zone
This chapter discusses approach 2 where an web api is designed and published for data providers to push their data into the landing zone.
Pull Mechanism#
In this scenario Databricks actively initiates data retrieval via APIs, queries, or batch reads from any of the below source system types.
| Source System Types | Examples |
|---|---|
| Databases (Relational and NoSQL) | Oracle, SQL Server, PostgreSQL |
| Data Warehouses | Amazon Redshift, Snowflake, SAP BW |
| File-based Sources | CSV, JSON, XML, Parquet, Avro, ORC, Excel stored on cloud or local file systems (e.g., HDFS) |
| Data Lakes | Amazon S3, Azure Data Lake Storage (ADLS), Hadoop Distributed File System |
| APIs (REST or SOAP) | Salesforce, SAP, ServiceNow |
| Public Datasets or Open Data Platforms | Government Open Data Portals |
| Distributed Systems | Read directly from Hadoop DFS or other distributed storage |
| Cloud Services/Platforms (Batch Reads) | Azure Blob Storage, Amazon Glacier |
| Virtualized Sources | Query via Presto |
| Proprietary Systems (custom solutions) | Custom applications where Databricks queries data using APIs or database connectors |
Push Mechanism#
In this scenario Databricks listens for and consumes data streams or pushes from sources. The table below gives you a pretty good breakdown of the types of sources that can be used for a pull mechanism.
| Source System Types | Examples |
|---|---|
| Streaming Data Sources | Apache Kafka, Azure Event Hubs, Apache Flink |
| Message Queues | Amazon SQS |
| Enterprise Data Integration Platforms (Push) | Informatica Fivetran |
| Cloud Services (Streaming) | Confluent |
| Delta Sharing (Push model) | A Databricks-native platform for secure sharing of real-time data |
| SaaS Applications (Push Events) | Salesforce (event notifications), Successfactors |
Process Flow#
The process flow is describing what activities / steps should be followed be an engineering when considering the ingestion pipeline. Depending on the desired configuration you would implement one or more of these pipelines
- 1 - Implement Pipeline - Source to Landing(This will use the pull mechanism, if your solution does not pull data then you will not implement this as your solution will provide another means to get data into the landing zone e.g API)
- 2 - Implement Pipeline - Landing Zone Bronze(this will only be used if your solution includes storing of the data into the landing zone.)
- 3 - Implement Pipeline - Direct Ingestion Into Bronze (only this pipeline would be implemented if your solution does not need to store data in the landing zone)