Skip to content

Data Engineering Guardrails and Best Practices - Data Ingestion and Integration#

Back to Menu

Version Control#

Version Date Owner Change Description
0.1 18 March 2025 Gareth Stretch Gareth Stretch, Shahed Mirza
0.2 26 March 2025 Gareth Stretch Split out each section to its own file :
0.3 30 April 2025 Gareth Stretch Capture MFT Guardrails and best practices :

Definition#

When documenting guardrails and best practices for a Databricks data platform, a good description of data ingestion patterns must include the following:

Batch Ingestion: This pattern involves loading data in discrete chunks at scheduled intervals. It's suitable for scenarios where real-time processing isn't critical, such as nightly updates or periodic data synchronization.

Streaming Ingestion: This pattern is used for real-time or near-real-time data processing. It continuously ingests data from sources like IoT devices, logs, or event streams, enabling immediate analysis and action.

Hybrid Ingestion: A combination of batch and streaming ingestion, this pattern is useful for systems that require both real-time insights and periodic comprehensive updates.

File-Based Ingestion / MFT: This involves ingesting data from files stored in cloud storage or on-premises systems. It's often used for structured or semi-structured data like CSV, JSON, or Parquet files.

Database Ingestion: This pattern extracts data directly from databases using connectors or APIs. It's ideal for integrating transactional data into the Databricks platform.

API-Based Ingestion: Data is ingested through APIs, often from external systems or third-party services. This pattern is common for integrating diverse data sources.

Event-Driven Ingestion: This pattern triggers data ingestion based on specific events, such as a new file upload or a database update. It's efficient for handling dynamic and unpredictable data flows

Each pattern should be accompanied by guardrails, such as ensuring data quality, monitoring for errors, and implementing security measures like encryption and access controls.

Data Ingestion and Integration#

Batch Ingestion#

Definition : Batch Data Ingestion refers to the process of loading data into the platform in sets or batches of rows or documents/files. This can be done on a scheduled basis (e.g., daily) or triggered manually. It represents the "extract" piece of traditional extract, transform, load (ETL) use cases.

Batch data ingestion is crucial for efficiently handling large volumes of data where timeliness is not crucial/critical

Guardrails#

Topic Description Justification
Batch Size Limit the batch size Optimal batch size guarantees optimal price / performance ratio, reduce infrastructure failures and bottlenecks
File Formats Encourage the Use of Open Interfaces and Open Formats It ensures interoperability and re-usability of ingested data
Naming Conventions Use common naming conventions for file names and directory names
Functional separation Files of same functionality or types can be put in the same directories and sub-folders. Do not mix files of different functionalities or types in the same directory/folders
File Metadata Maintain metadata for each ingested files clearly documenting name of file, file provider, file creation time, file integrity information etc

Standards#

Topic Description Justification
Data Volume Limits Set maximum limits for data ingestion per hour to ensure system stability. For example, ingesting up to 200 GB of data per hour using batch sources like Azure Blob, Amazon S3, and SFTP Setting limits can help in optimal system performance, Processing needs, identify maximum scaling, predict costs
Batch Size Define acceptable batch sizes, typically between 256 MB and 100 GB, to optimize performance and avoid system overload Setting limits can help in optimal system performance,Processing needs, identify maximum scaling, predict costs
File Limits Limit the number of files per batch to prevent excessive load on the system. For instance, a maximum of 1500 files per batch Setting limits can help in optimal system performance, Processing needs, identify maximum scaling, predict costs
Batch Frequency Set limits on the number of batches ingested per day to maintain system performance. A recommended limit might be 90 batches per day Setting limits can help in optimal system performance, Processing needs, identify maximum scaling, predict costs
Encryption Ensure encrypted data files do not exceed a certain size, such as 1 GB per file, to maintain security and performance Encrypted files need processing power to de-crypt while trying to load them and can cost for each file being decrypted

Best Practice#

Topic Description Recommendation
Data Partitioning Organize data into partitions to improve query performance and manageability. This helps in efficiently processing large datasets. 1.) Keep records to below 2 million, 2.) define partition strategies per data topic,eg you would apply a different partition strategy for products as you would for sales transactions. NB. Partition the data on attributes that equally distribute data within the partition
Implement Incremental data loads You should by default design with an incremental load strategy for all data ingestion in order to reduce processing time and platform overheads. While sometimes it is un-avoidable to use full load, developers and architects should look for opportunities to do incremental load. Benefits include reduction in data movement, data processing and if there is no change in delta, there is no point in re-processing such files
Implement Full data loads You should only implement full data loads in scenarios where there is a requirement to do an initial load, or bring a table back in sync that requires a reload. Full data loads are needed when there is migration from one system to another or there is a need to sync source and target due to missed updates/deletes/inserts
Compression Use compression techniques to reduce the size of data being ingested, which can lead to faster transfer times and lower storage costs Open text format files like .CSV,.JSON are good candidates for compressions. When transferring such files from one cloud to another or from cloud to on-premise, make sure to compress them before transfer. This will reduce egress charges from the cloud provider
Monitoring and Logging Implement monitoring solutions to track ingestion performance and identify bottlenecks. Logging helps in troubleshooting and maintaining data quality Always keep track of ingestion performance and look for ways to improve ingestion timings and reduce processing costs
Schema Evolution Handle schema changes gracefully by using tools like Delta Lake, which supports schema evolution and enforcement While evolution is not supported by traditional systems like Hive storage, advanced technologies like Delta lake, iceberg help schema evolution
Data Validation Perform data validation checks to ensure data quality before ingestion. This includes checking for null values, duplicates, and data consistency Basic data quality checks in earlier stages of ingestion helps reduce major errors later in a pipelines. Certain checks before ingestion helps identify errors early and helps in system stability
Batch Scheduling Schedule batch jobs during off-peak hours to minimize impact on system performance and ensure timely data availability As a general practice, load data during low peak hours. Data loading might need excessive resources and can impact user interactions with the systems
Error Handling Implement robust error handling mechanisms to manage and recover from ingestion failures Errors should be gracefully handled and must have a clear path to next steps in case of a failure. e.g. whether to continue, abort, skip etc
Single Record failure handling You need to implement a strategy to handle a scenario where only a few records fail out of the batch Create a new file to store the failed records for later processing, continue the job.
Whole file failure handling You need to implement a strategy to handle the scenario where all records fail in the batch. When a complete file fails in a batch job, error handling should have clear path to completion. E.g. if an MDM file has failed, it is not a good idea to continue, as other files/tables depend on it
Corrupted File failure handling Corrupted files can be shared from sources either due system writing failures, network errors and others. Such files should be identified early and moved to separate folders to avoid re-processing Corrupted files should be identified using algorithm on a case be case basis. E.g. an in-valid xml file cannot be loaded into data lake and should be moved as error file and systems error catalog should be updated

Streaming Ingestion#

Definition :#

Streaming ingestion refers to the process of continuously ingesting data in near real-time from various sources into Databricks. This is achieved using microbatches, where data is processed in small, frequent batches, ensuring that the data is always up-to-date.

Streaming ingestion is essential for applications that require immediate data processing and analysis, providing businesses with the ability to make timely and informed decisions based on the most current data available

Guardrails#

Topic Description Justification
Detective Guardrail Make sure stream files received from stream service are within acceptable size range and would not be empty, to huge or too small. E.g. a Json file for IoT device can be between 2kb to 20kb, other files received can be error or corrupted Early detection of errors reduces major problems later in a pipeline
Corrective Guardrail Stream data should be within acceptable character sets e.g. characters and numbers, other characters like special characters are out of range. Such files should either be corrected or errored out Have a defined character-set that is allowed within a file e.g. [A-Z][a-z][0-1] puntuations:[] {} (), files containing characters outside a characterset are either generated in error by data provider or are corrupted files
Preventive Guardrail Once a stream file is formed and is ready for processing, make basic checks on validity, size and content of the file before proceeding to processing.Errored files should be marked as invalid and moved to error folders Sometimes files are part formed during stream process and may be corrupted. basic validity checks in early stages of a pipeline help reducing major errors later in a pipeline
Minimum Lag Data consumer lag should be near zero, meaning delay between data producer and data consumers should be minimum normally calculated as lag = messages / (consume rate per second - produce rate per second) Always monitor data lag between data producer and data consumers. This should be near zero. Less the data lag, more the data is realtime
Message brokers Monitor your brokers for network throughput—both transmit (TX) and receive (RX)—as well as disk I/O, disk space, and CPU usage. Messages received in realtime should be monitored for network lag/delays and should also monitor usage as far as CPU, Disk i/o and Disk space usage
Data Quality Monitoring This guardrail includes Schema Validation, Data Cleanliness, Time-based validations Schema validation helps in maintaining data consistency, Data cleanliness helps avoid duplicates, incomplete or erroneous data thus improving quality of data lake
Scalability and Load management Scalability and load management typically involves auto-scaling and backpressure handling Stream ingestion should auto-scale up or scale down depending on increasing or decreasing load. Such functionalities are provided by almost all cloud service providers.Certain spikes in incoming data can overwhelm Streaming systems. Ensure such scenarios are handled gracefully without losing any data
Latency and Real-time processing Streaming systems should always try for near zero latency and provide real-time analytics While Streaming can never be real-time due to certain issues like lag, disk i/o and network speeds, always strive for near-zero latency. Real-time pipeline ingestions help in real-time insights on data as it's ingested
Fault Tolerence and Reliability Ingestions systems should be fault tolerance and data should be idempotent Use appropriate data replication mechanisms so that in case of failures, copies of source data can be re-submitted. Duplicate data is safely processed without adverse impact on the system

Best Practice#

Topic Description Recommendation
Same type of Data If streams contains same type of data, it is better to perform incremental refresh. in databricks, a job can be schedule to run every 5minutes/half hour/hourly/etc using trigger.Once of Trigger.AvailableNow to process incrementable date. This will reduce processing costs and always ON streaming If data from a stream is same type and some delay is accepted, it is better to use triggered file pulls. This will reduce always ON processing costs for new data
Different types of data/files Stream data received in different file formats viz. CSV,JSON etc can be uploaded using databricks COPY INTO commands. This command can be repeatedly run landing zone and it will only process new files, taking care of incremental data COPY INTO only loads new files and based on file types, orchestrations can be designed to help load different types of files into different tables for further processing
Large amount of files with different types Above techniques works well with limited number of files received, but if there are large number of files, it is advisable to use AutoLoader When there is a large number of files received from the source, a COPY INTO command may slow down. AutoLoader functionality will take care of large number and huge files upload
DLT pipelines Delta Live tables is new feature offered by databricks allows to build pipelines for streaming data.This is an efficient way of loading data from streams DLT tables offers simplified pipeline management and enhanced data quality checks.It uses declarative approach, unified batch and streaming experience, incremental processing and comes with built-in monitoring and management.
Modular and decoupled architecture Separate ingestion process into modules based on functionality e.g. ingestion, transformation and storage. Use microservices or serverless architectures to create independent, scalable components for different parts of the pipeline Modular design helps better scalability, maintainability and fault isolation while microservices or serverless models helps independent processing and error handling
Partitioning and Sharding Partition ingested data based on logical categories e.g. hourly partition or source data provider etc. High volumes of streams can be handled using sharding Partitioning help improve query performance in a data lake and allows parallel processing. Sharding divides large volumes of data into smaller manageable chunks for processing
Data Transformation and enrichment Deploy real-time ETL processes to clean, filter and enrich incoming data before storing it in the data lake. Event based enrichment allows enriching incoming events with metadata or lookups In both real-time and event based data enrichment helps in producing more meaningful data before storing it
Monitoring and Alerting Monitoring helps in keeping an eye on the health of streaming systems and Alerts helps in providing immediate attention to failures, delays and inconsistencies in data Implementing monitoring dashboards to track the health and performance of Streams can be monitored by operations teams. Cloud providers provide inbuilt services for monitoring activities. Alerting can be setup to create Incident tickets, email notification or even Phone notifications in case of failures and other issues in pipeline processing
Data Storage and Management High frequency real-time data should be segregated from infrequent historical data. Data formats should be selected which is efficient and provide good query performance High Frequency real-time data should be stored separated from historical data e.g. in AWS S3 Glacier storage or Azure archive storage. Columnar data storage formats like Parquet or ORC helps efficient storage and query performance. On top of that, Compression techniques can help reducing data storage costs
Cost management Cost-effective services like managed services e.g. AWS Kinesis or Azure Even hub are useful in smaller streaming applications while Self-managed solutions e.g. Apache Kafka are cost-effective at scale. Auto shrinking/scaling for storage helps in cost based on usage pattens Managed services are useful small-scale streaming applications while self-managed services are cost effective at large scale. Moving data from live storage to archive storage helps reduce storage costs on high availability disks
Data Lineage and Traceabilty Stream data should be able to traced back to its source. Changes to the source data should be maintained by immutable log of events Medallion architecture helps keep trace of stream data through different layers. Attaching file/stream name and timestamp to a record of data also helps in tracing back to original file/stream. Any changes on the data can be maintained in read-only logs file/table

API Ingestion#

Definition :#

API ingestion refers to the process of importing data into Databricks from various external sources using APIs (rest, web services, custom api's). This method allows for the integration of data from third-party applications, and other external systems.

API ingestion is crucial for integrating diverse and real-time data sources into Databricks, enabling efficient data processing and analysis.

Guardrails#

Topic Description Justification
REST API/CURL requests Make sure REST API calls are for authentic sources and vetted by Data security and compliance teams. Appropriate ports and access should be opened within network to access REST API systems REST API Sources should be authentic, any changes in URLs should be vetted by data security and compliance teams. If correct ports are no open, the requests may fail or will try for indefinite period of time
REST API/CURL requests Method to PUSH (POST) or PULL (GET) should be clearly defined and the connection to the hosts should be steady until complete URL response is received Make sure while making REST API/Curl requests, connection is live and there is enough network bandwidth available to handle response
Storage of original dataset store the data in the original format Allows you to reprocess the dataset
Security Protocol Always insist on using secure HTTPS protocol for handshake and data transfer.Make sure SSL certificates are valid and up-to date Using HTTPS protocols help evade man in the middle type of attacks when connecting over the internet and transferring sensitive data
Secure API Keys Store API keys and credentials securely using Databricks secrets management. Avoid hardcoding sensitive information in notebooks
Schema Validation Validate the schema of the incoming data to ensure it matches the expected format
Retry Logic Implement retry mechanisms to handle transient errors and ensure data is eventually ingested successfully. Use exponential backoff strategies to manage retries

Best Practice#

Topic Description Recommendation
Compute Cluster Configuration For Batch ingestion configure standard compute to be active only during ingestion time Cost saving.
Resource Naming Use well defined names for APIs so as not to confuse data consumers.Use nouns for resource endpoints and nesting urls in case of variable interface Using well defined and distinguished endpoint names helps reducing endpoint ambiguity and provide clear guidance
HTTP Protocol Transferring data over the network through API calls should always be secured with SSL certificates. Using HTTPS protocols helps encrypting data transfer reducing risks of man in the middle type and certificate pinning types of cyber attacks Secure protocol while sending and receiving data over internet masks data and important information like credentials
Authentication API end points should always ask for credentials in order to provide access to data. Other techniques like certificate authentication are also secure An API endpoint without authentication is vulnerable to many attacks like SQL Injection,Cross-Site Scripting (XSS),Cross-Site Request Forgery (CSRF),Broken Access Control,Insecure Direct Object References,Security Misconfiguration etc
Handling Errors Handle errors gracefully on both sender and receiver side. Once an error occurred, connections should be disconnected and appropriate course of actions to be followed Keeping connections open to the URL will impact provider throughput, and also impact receiver network bandwidth
Documentation Maintain clear and concise documentation of all the APIs available from the provider with well defined use cases of those APIs Good documentation is always helpful to writing well formed API calls and helps new comers to get up to speed with APIs
Efficient Data Processing Batch vs. Streaming: Choose between batch and streaming ingestion based on the use case. For real-time data, use Databricks Structured Streaming Cost Saving
Landing Zone Store raw API data in a landing zone (e.g., a dedicated storage account or directory) before processing. his helps in isolating raw data from processed data.

Managed File Transfer#

Definition :#

Managed file transfer (MFT) is a technology platform that allows organizations to reliably exchange electronic data between systems and people, within and outside the enterprise, securely and in compliance with applicable regulations.

Guardrails#

Guardrail Description
Secure Transfer Protocols Use secure protocols like SFTP or FTPS to ensure data is encrypted in transit.
Access Control Implement strict access controls to ensure only authorized users can access data.
Data Validation Validate the schema and format of incoming data to ensure it meets expectations.
Error Logging Implement robust error logging to capture and analyze issues during ingestion.
Retry Mechanisms Implement retry mechanisms to handle transient failures and ensure data ingestion.
Monitoring Set up monitoring to track the performance and health of the ingestion pipeline.
Compliance Ensure the data transfer process complies with relevant regulations and standards.
Documentation Maintain comprehensive documentation of the ingestion process and data lineage.

Best Practices#

Best Practice Description
Landing Zone Store raw data in a landing zone before processing to isolate raw from processed data.
Bronze, Silver, Gold Layers Organize data into Bronze (raw), Silver (cleaned), and Gold (aggregated) layers.
Incremental Loads Use incremental loading techniques to process only new or changed data.
Parallel Processing Leverage Databricks' distributed computing capabilities for parallel processing.
Performance Tuning Optimize performance by tuning cluster configurations and using efficient file formats.
Automation Use Databricks Jobs or Workflows to automate the ingestion process.
Data Quality Checks Implement checks for data completeness, accuracy, and consistency.
Scalability Design the ingestion pipeline to scale horizontally to handle increasing data volumes.

By adhering to these guardrails and best practices, you can ensure a robust, secure, and efficient data ingestion process from Managed File Transfer systems into Databricks.

Back to As-Is Architecture for ingestion
Back to Target Architecture for ingestion
View Gaps and Recommendations