Skip to content

Data Engineering Guardrails and Best Practices - Data Processing and Transformation#

Back to Menu

Version Control#

Version Date Owner Change Description
0.1 18 March 2025 Gareth Stretch Gareth Stretch, Shahed Mirza

Definition

Data processing and data transformation are some of the activities where we extract data from different systems, merge it, clean it and use it in subsequent data analyses. From different piles, we usually make one big but usable pile of data. After the data analyses, the visualizations take a turn.

Data transformation is a critical part of the data integration process in which raw data is converted into a unified format or structure. Data transformation ensures compatibility with target systems and enhances data quality and usability. It is an essential aspect of data management practices including data wrangling, data analysis and data warehousing.

Data Processing and Transformation#

Data Enrichment#

Definition : Data enrichment is a critical process in data management that enhances raw data’s utility, accuracy, and richness. Data enrichment, meaning merging data from external sources with an existing database, helps refine, elaborate upon, or otherwise improve the original dataset. This process is essential in maintaining data integrity for businesses and organizations that rely on data for decision-making, strategizing, and maintaining competitive advantages.

Guardrails#

Topic Description Recommendation
Data Cleansing Before enriching, ensure your existing data is clean, accurate, and consistent Data should not contain invalid data, NULLS and undefined characters. Such data errors must be removed before enrichment process can begin
Remove Duplicates and Errors Identify and rectify inaccuracies, and standardize data formats unless duplicates are necessary, it is best practice to remove duplicates from source. Enriching involves updating existing recordset, and duplicates results in multiple record updates
Data Munging Use data munging techniques to enhance dataset quality Also called data wrangling helps in cleaning, transforming and preparing raw data into stable format for analysis and insights

Best Practice#

Topic Description Recommendation
Choose Reliable Sources Select reputable and trustworthy data providers offering accurate, up-to-date, and relevant information If same data is available from multiple sources, a choice of most reliable, reputable and trustworthy data providers is recommended
Evaluate Data Quality Consider factors like data quality, coverage, and compliance with privacy regulations Stringent data privacy laws and regulations require maintaining accurate, up-to-date data. Data enrichment helps ensure compliance by updating outdated information and validating the accuracy of datasets, which reduces legal and financial risks
Validate Data Against Trusted Sources Ensure the credibility of your data sources by validating them against trusted external data Enriched data can be compared with other external trusted sources for validity, correctness and upto date information
Set Specific Goals Determine clear objectives for data enrichment, like enhancing customer profiles or improving segmentation, with measurable targets Goals of data enrichment should be clear precise and measurable
Identify Data Needs Understand what information is missing from your existing datasets and what data would be most valuable for your business goals It is always a good practice to identify how an existing dataset can be enriched with missing information that will help for achieving business goals
Prioritize Data Focus on adding data that adds significant insight and value, such as behavioral patterns to refine marketing tactics Priority should be given to enrichments that adds significant insight and business value to a dataset

Data Type Conversions#

Definition : Performing data type conversions in a Medallion architecture is crucial for several reasons, particularly as data progresses through the Bronze, Silver, and Gold layers. By performing data type conversions at appropriate stages in the Medallion architecture, you can ensure that your data is clean, consistent, and ready for high-quality analysis and reporting.

  1. Data Quality and Consistency Standardization: Converting data types ensures that data is standardized across different sources, making it consistent and reliable. Error Reduction: Proper data type conversions help in identifying and correcting data type mismatches early, reducing errors in downstream processing.
  2. Improved Data Processing Efficiency: Converting data to appropriate types can improve the efficiency of data processing operations, such as joins, aggregations, and transformations. Compatibility: Ensures compatibility with various data processing tools and frameworks, facilitating smoother data integration2.
  3. Enhanced Data Analysis Accurate Analysis: Ensures that numerical data is treated as numbers, dates as dates, etc., leading to more accurate analysis and insights. Advanced Analytics: Enables the use of advanced analytical techniques and machine learning models that require specific data types.
  4. Data Governance and Compliance Data Integrity: Maintains data integrity by ensuring that data conforms to expected formats and types2. Regulatory Compliance: Helps in meeting regulatory requirements by ensuring data is stored and processed in compliant formats.
  5. Simplified Data Management Ease of Use: Simplifies data management by ensuring that data is in a predictable and usable format throughout its lifecycle. Maintenance: Reduces the complexity of maintaining data pipelines by avoiding the need for repeated type conversions.
  6. Performance Optimization Query Performance: Optimizes query performance by ensuring that data is stored in the most efficient format for the underlying database or data lake. Resource Utilization: Improves resource utilization by reducing the overhead associated with on-the-fly type conversions.

Guardrails#

Topic Description Recommendation
Conversion Type Always use explicit conversions (like CAST or CONVERT) for clarity and control While implicit type conversion are easy and require less effort, explicit type conversion helps others to understand what the code is doing and helps in error handling
Errors in Conversion Handle potential errors gracefully (using TRY_CAST or TRY_CONVERT) In case for converting different data types, like character to intergers, it is always a good practice to handle type conversion errors gracefully
Implicit Conversion Use implicit conversion when you are sure the source data is always of desired format and does not contains NULLs Write comments in the code, wherever implicit conversion is being used

Best Practice#

Topic Description Recommendation
Source and Target Data Types Before converting, thoroughly understand the characteristics of both the source and target data types, including precision, measurement scale, and other relevant attributes As a best practice load all data as character/varchar into a dataset and do explicit conversion later on to identify and remediate type conversion errors
Explicit Conversion Functions Always use explicit conversion functions like CAST or CONVERT to ensure clarity and control over data conversions.This makes the conversion process transparent and easier to understand and maintain Explicit conversions are helpful in understanding code and error handling
Handle Conversion Errors Gracefully Use functions like TRY_CAST or TRY_CONVERT to safely handle potential conversion errors without halting queries.These functions allow you to gracefully handle cases where a value cannot be converted to the target type Explicit type conversion allows you to handle gracefully and provide an alternate route in case of failure during conversion

Data Masking#

Definition: Data masking is the process of hiding data by modifying its original letters and numbers. Due to regulatory and privacy requirements, organizations must protect the sensitive data they collect about their customers and operations.

Guardrails#

Topic Description Recommendation
Data Discovery before you can protect your data, you need to have a grasp of the data you are holding, and distinguish the various types of information with varying degrees of sensitivity. Security and business experts typically collaborate to produce an exhaustive record of all the data components across an enterprise Data objects should be classified as Confidential, Internal and public. Having such classification helps in identifying level of clearance to access data, Level of data masking and distribution strategies
Survey of circumstances the security director responsible for determining the availability of sensitive data should oversee the circumstances in which the data is stored and used, and decide on the appropriate concealing strategy for each type of data There should be a different teams responsible for data security and masking and circumstances when these techniques should be used
Veiling actualization for large enterprises, it is not realistic to apply a single data masking technique across all datasets. Each type of data has to be considered in terms of the appropriate arrangement, engineering and usage needs Different types of data needs different type of data masking, so a single data masking technique is not advisable for all types of data
Veiling testing This involves testing the results of data veiling techniques. The QA and testing teams must guarantee that the data masking techniques used offer the desired outcomes. In the event that a masking technique falls short of expectations, the DBA must restore the database to the original, unmasked state and apply a new masking procedure with new calculations Data Masking techniques should be thoroughly tested in lower environments and signed off, before proceeding to deployment in production
Techniques for Data masking Define clear techniques to be used for datamasking and for which type of data e.g. Data Psuedonymazation, Data Anonymization, Lookup substitution, Encryption, Redaction, Averaging, Shuffling, Date Switching etc. There should be clear defined techniques for each type of data use case
Irreversibility The algorithms must be designed such that once data has been masked, you can't back out the original values or reverse engineer the masked data. Choose techniques and data that requires appropriate techniques. Some data masking techniques are easier to decrypt than others. For Confidential or PII data, it is always desireable to have irreversible data masking techniques
Repeatable Masking is not a one-time process. , it Organizations should perform data masking should happen repeatedly as data changes over time. It needs to be fast and automatic while allowing integration with your workflows, such as SDLC or DevOps processes As the data grows and changes, data masking should be performed repeatedly over changed data and new data
Automated Many data masking solutions often add operational overhead and prolongs test cycles for a company. But with an automated approach, teams can easily identify sensitive information such as names, email addresses, and payment information to provide an enterprise-wide view of risk and to pinpoint targets for masking. Modern data masking tools provide efficient and automated approach to data masking
Policy Based With a policy-based approach, your data can be tokenized and reversibly or irreversibly masked in accordance with internal standards and privacy regulations such as GDPR, CCPA, and HIPAA. Taken together, these capabilities allow businesses to define, manage, and apply security policies from a single point of control across large, complex data estates in real-time. Apply data masking based on policy, rules and regulations such as GDPR, CCPA and HIPAA

Best Practice#

Topic Description Recommendation
Data Masking Project Scope In order to effectively perform data masking, companies should know what information needs to be protected, who is authorized to see it, which applications use the data, and where it resides, both in production and non-production domains. While this may seem easy on paper, due to the complexity of operations and multiple lines of business, this process may require a substantial effort and must be planned as a separate stage of the project. Scope definition clearly identifies systems, services and processes that will consume, store and process PII, confidential data. Sensitive systems can be protected well when they are clearly defined within scope
Refrential Integrity Referential integrity means that each “type” of information coming from a business application must be masked using the same algorithm During masking process, some of the fields like ID in reference tables may be missed out, resulting in failure of break downm in refrential integrity. Make sure same "type" of columns are masked using same algorithm, so refrential intigrity is maintained
Data Masking Algorithms It is critical to consider how to protect the data making algorithms, as well as alternative data sets or dictionaries used to scramble the data Data Masking algorithms are very sensitive and should be tighly secured, and should be known only to the authorised personnel. If any un-authorised person get details of such algorithms, they can reverse engineer the masked data

Change Data Capture#

Definition : Change Data Capture (CDC) is a method used in databases to track and record changes made to data. It captures modifications like inserts, updates, and deletes, and stores them for analysis or replication. CDC helps maintain data consistency across different systems by keeping track of alterations in real-time. It's like having a digital detective that monitors changes in a database and keeps a log of what happened and when.

Guardrails#

Topic Description Recommendation
CDC Method Choose correct and efficient CDC method, viz. Log-based CDC,Trigger-based CDC, Query-based CDC One CDC technique may not work for all type of CDC use cases, so it is advisable to decide best CDC technique on a use case basis
CDC Performance Measure CDC Performance by monitoring key metrics, adjust CDC parameters like maxscan, maxtrans, and pollinginterval CDC requires lot of processing, disk i/o and cost. Monitoring perfomance metrics helps in fine tuning CDC performance
CDC Batch processing Using batch processing for Change data capture reduces load on the source database and is often used When the need for changes is not realtime, it is better to perform CDC in batches e.g. hourly, daily etc. This not only reduces load on source systems but also reduces costs on target systems
Database Design Design target databases with CDC as requirements such as using appropriate data types and indexing/partitioning/clustering Database design is crucial in faster processing of CDC. Poor database design could take longer time to process and could result in higher costs and delays
Data Transformation Layer Consider using a data transformation layer to handle complex transformations without impacting the source database It is always a best practice to use data transformation layer (Silver or Governed) to do CDC, making it independant of source and target layer
Stream Processing Utilize stream processing technologies for real-time data processing Stream processing Technologies like Snowflake's Snowpipe, Databricks Delta live tables (DLT) and Azure event hub/Azure IoT hub provides real-time data processing and change data capture
Security and Compliance Considerations Implement robust security measures to protect data during transfer and storage Robust security measures must be implemented for data in transit and at rest. Data regulation compliance should be maintained at all stages of CDC
Data Encryption Use encryption methods and secure communication channels to prevent data breaches when transferring data within organization or outside, usee data encryption channels like SSL and TLS1.2
Integrity Checks Implement integrity checks to ensure data accuracy and prevent unauthorized access Changing data in a table should be checked for not breaking any refrential integrity and joins with external tables
Modern Databases/Datawarehouse/Data lake Many modern databases, datawarehouse and data lakes provide efficient CDC functionalities like Databricks Delta Lake, Snowflake Time travel etc In databricks, change versions are maintained when there is an update, delete or insert. Data can be accessed from original state before CDC was applied.Snowflake also provide similar functionalities

Best Practice#

Topic Description Recommendation
Tools Utilize CDC tools and platforms specifically designed for integrating with existing data pipelines. These tools often provide out-of-the-box connectors and adapters for popular databases and messaging systems, simplifying the integration process There are many tools available within clouds and COTS products which are efficient in providing CDC functionalities. CDC techniques can also
Deletes during CDC Define a well established criteria for handling deleted records in the source A hard delete permanently removes specific data records from the destination system. After a hard delete, users cannot recover or restore the data without a backup.A soft delete doesn’t actually delete information. Instead, it marks records as deleted with a “deleted” flag or by updating a “yes/no delete status" field: connected to the record. Unlike hard deletes, soft-deleted information stays in the system. This gives users the choice to either filter out, call, or undelete the soft-deleted records at any time.
Incorporate monitoring and logging processes Ensure you have proper monitoring and logging mechanisms to track the quality and performance of the CDC tools. Setting up alerts for data anomalies and errors is also a good idea. Always monitor and track quality and performance of CDC tools, alerts in case anomalies or errors help in timely remediation of issues
Data Consistency and Integrity CDC ensures that no changes are missed or duplicated, guaranteeing the accuracy of downstream analytics and reporting Always maintain data integrity during CDC process and avoid creating duplicate records, unless it is absolutely necessary
Scalability and Performance Make sure your CDC architecture is robust enough to offer scalability and handle data as it grows. Businesses choose horizontal scaling options, load balancing, and optimizing query performance regarding massive datasets. Design CDC architecture which is robust enough for future workloads and handle large dataset. This can be tested during Load testing phase of CDC architecture

Performance Optimization#

Definition : Performance optimization in Databricks, especially in the context of data transformations, involves a series of best practices and strategies aimed at improving the efficiency, speed, and resource utilization of data processing tasks.

Category Description
Cluster Configuration - Cluster Sizing: Choose the right size and type of cluster based on the workload. Use autoscaling to adjust the number of nodes dynamically based on the demand.
- Instance Types: Select appropriate instance types (e.g., memory-optimized, compute-optimized) based on the nature of the transformations.
- Spot Instances: Consider using spot instances for cost savings, but ensure that your workload can handle potential interruptions.
Data Partitioning - Partitioning Strategy: Partition data based on the columns that are frequently used in filters and joins. This reduces the amount of data scanned during queries.
- Repartitioning: Use repartition or coalesce to adjust the number of partitions. Repartitioning can help in balancing the load across nodes, but avoid excessive repartitioning as it can be costly.
Caching and Persistence - Caching: Cache intermediate results using cache() or persist() to avoid recomputation of expensive transformations. Choose the appropriate storage level (e.g., MEMORY_ONLY, MEMORY_AND_DISK) based on the available memory.
- Unpersisting: Unpersist dataframes that are no longer needed to free up memory.
Efficient Data Formats - Columnar Formats: Use columnar data formats like Parquet or ORC for storage. These formats are optimized for read-heavy operations and support efficient compression and encoding.
- Compression: Apply appropriate compression techniques to reduce storage costs and improve I/O performance.
Optimized Joins and Aggregations - Broadcast Joins: Use broadcast joins for small tables to avoid shuffling large datasets. Use the broadcast function to explicitly mark a dataframe for broadcasting.
- Skew Handling: Address data skew by using techniques like salting or repartitioning to ensure even distribution of data across nodes.
- Aggregation Pushdown: Push down aggregations to the data source whenever possible to reduce the amount of data transferred and processed.
Query Optimization - Predicate Pushdown: Ensure that filters are pushed down to the data source to minimize the amount of data read.
- Vectorized Execution: Use vectorized execution engines like Apache Arrow to speed up data processing.
- Adaptive Query Execution (AQE): Enable AQE to dynamically optimize query plans based on runtime statistics.
Resource Management - Task Parallelism: Increase the level of parallelism by adjusting the number of tasks. Ensure that the number of tasks is a multiple of the number of cores in the cluster.
- Executor Memory: Allocate sufficient memory to executors to handle large datasets and avoid out-of-memory errors.
Monitoring and Tuning - Spark UI: Use the Spark UI to monitor job execution, identify bottlenecks, and understand the performance characteristics of your transformations.
- Ganglia and Datadog: Integrate with monitoring tools like Ganglia or Datadog to track cluster performance metrics and set up alerts for critical issues.
- Job Metrics: Collect and analyze job metrics to identify slow stages and optimize them.
Code Optimization - Efficient Code Practices: Write efficient Spark code by avoiding unnecessary shuffles, using map-side operations, and minimizing the use of UDFs (User Defined Functions).
- Code Review: Regularly review and refactor code to ensure it follows best practices and is optimized for performance.

Guardrails#

Category Description
Cluster Configuration - Cluster Sizing: Ensure clusters are appropriately sized to handle the workload without over-provisioning.
- Instance Types: Use suitable instance types based on the nature of the transformations.
- Spot Instances: Use spot instances cautiously, ensuring workloads can handle interruptions.
Data Partitioning - Partitioning Strategy: Partition data based on frequently used columns in filters and joins to reduce data scan.
- Repartitioning: Avoid excessive repartitioning to prevent unnecessary costs.
Caching and Persistence - Caching: Cache intermediate results judiciously to avoid recomputation of expensive transformations.
- Unpersisting: Unpersist dataframes that are no longer needed to free up memory.
Efficient Data Formats - Columnar Formats: Use columnar data formats like Parquet or ORC for storage to optimize read-heavy operations.
- Compression: Apply appropriate compression techniques to reduce storage costs and improve I/O performance.
Optimized Joins and Aggregations - Broadcast Joins: Use broadcast joins for small tables to avoid shuffling large datasets.
- Skew Handling: Address data skew to ensure even distribution of data across nodes.
- Aggregation Pushdown: Push down aggregations to the data source to reduce data transfer and processing.
Query Optimization - Predicate Pushdown: Ensure filters are pushed down to the data source to minimize data read.
- Vectorized Execution: Use vectorized execution engines like Apache Arrow to speed up data processing.
- Adaptive Query Execution (AQE): Enable AQE to dynamically optimize query plans based on runtime statistics.
Resource Management - Task Parallelism: Adjust the number of tasks to increase parallelism, ensuring it matches the number of cores in the cluster.
- Executor Memory: Allocate sufficient memory to executors to handle large datasets and avoid out-of-memory errors.
Monitoring and Tuning - Spark UI: Use the Spark UI to monitor job execution and identify bottlenecks.
- Ganglia and Datadog: Integrate with monitoring tools to track cluster performance metrics and set up alerts.
- Job Metrics: Collect and analyze job metrics to identify and optimize slow stages.
Code Optimization - Efficient Code Practices: Write efficient Spark code by avoiding unnecessary shuffles and minimizing the use of UDFs.
- Code Review: Regularly review and refactor code to ensure it follows best practices and is optimized for performance.

Best Practice#

Category Description
Cluster Configuration - Cluster Sizing: Choose the right size and type of cluster based on the workload. Use autoscaling to adjust the number of nodes dynamically based on the demand.
- Instance Types: Select appropriate instance types (e.g., memory-optimized, compute-optimized) based on the nature of the transformations.
- Spot Instances: Consider using spot instances for cost savings, but ensure that your workload can handle potential interruptions.
Data Partitioning - Partitioning Strategy: Partition data based on the columns that are frequently used in filters and joins. This reduces the amount of data scanned during queries.
- Repartitioning: Use repartition or coalesce to adjust the number of partitions. Repartitioning can help in balancing the load across nodes, but avoid excessive repartitioning as it can be costly.
Caching and Persistence - Caching: Cache intermediate results using cache() or persist() to avoid recomputation of expensive transformations. Choose the appropriate storage level (e.g., MEMORY_ONLY, MEMORY_AND_DISK) based on the available memory.
- Unpersisting: Unpersist dataframes that are no longer needed to free up memory.
Efficient Data Formats - Columnar Formats: Use columnar data formats like Parquet or ORC for storage. These formats are optimized for read-heavy operations and support efficient compression and encoding.
- Compression: Apply appropriate compression techniques to reduce storage costs and improve I/O performance.
Optimized Joins and Aggregations - Broadcast Joins: Use broadcast joins for small tables to avoid shuffling large datasets. Use the broadcast function to explicitly mark a dataframe for broadcasting.
- Skew Handling: Address data skew by using techniques like salting or repartitioning to ensure even distribution of data across nodes.
- Aggregation Pushdown: Push down aggregations to the data source whenever possible to reduce the amount of data transferred and processed.
Query Optimization - Predicate Pushdown: Ensure that filters are pushed down to the data source to minimize the amount of data read.
- Vectorized Execution: Use vectorized execution engines like Apache Arrow to speed up data processing.
- Adaptive Query Execution (AQE): Enable AQE to dynamically optimize query plans based on runtime statistics.
Resource Management - Task Parallelism: Increase the level of parallelism by adjusting the number of tasks. Ensure that the number of tasks is a multiple of the number of cores in the cluster.
- Executor Memory: Allocate sufficient memory to executors to handle large datasets and avoid out-of-memory errors.
Monitoring and Tuning - Spark UI: Use the Spark UI to monitor job execution, identify bottlenecks, and understand the performance characteristics of your transformations.
- Ganglia and Datadog: Integrate with monitoring tools like Ganglia or Datadog to track cluster performance metrics and set up alerts for critical issues.
- Job Metrics: Collect and analyze job metrics to identify slow stages and optimize them.
Code Optimization - Efficient Code Practices: Write efficient Spark code by avoiding unnecessary shuffles, using map-side operations, and minimizing the use of UDFs (User Defined Functions).
- Code Review: Regularly review and refactor code to ensure it follows best practices and is optimized for performance.

Data Quality#

Definition :

Guardrails#

Category Description
Validation Rules - Consistency Checks: Ensure data consistency across different datasets and transformations.
- Range Checks: Validate that data falls within expected ranges.
- Format Checks: Ensure data adheres to specified formats (e.g., date formats, numeric formats).
Error Handling - Logging: Implement comprehensive logging for errors encountered during data processing.
- Alerting: Set up alerts for critical errors to ensure timely intervention.
- Retry Mechanisms: Implement retry mechanisms for transient errors.
Data Profiling - Baseline Profiling: Establish baseline data profiles to monitor deviations.
- Anomaly Detection: Implement techniques to detect anomalies in data.
- Regular Audits: Conduct regular audits to ensure data quality standards are maintained.
Data Governance - Ownership: Clearly define data ownership and stewardship roles.
- Accountability: Ensure accountability for data quality management.
- Policies: Develop and enforce policies for data quality management.
Compliance - Regulatory Adherence: Ensure data quality practices comply with relevant regulations (e.g., GDPR, HIPAA).
- Auditability: Maintain audit trails for data quality checks to support compliance audits.

Best Practices#

Category Description
Validation Rules - Automated Validation: Implement automated validation rules to ensure data quality.
- Custom Rules: Develop custom validation rules tailored to specific data requirements.
- Continuous Monitoring: Continuously monitor data quality using validation rules.
Error Handling - Comprehensive Logging: Implement detailed logging to capture all errors and issues during data processing.
- Proactive Alerting: Set up proactive alerts for potential data quality issues.
- Robust Retry Mechanisms: Implement robust retry mechanisms to handle transient errors effectively.
Data Profiling - Regular Profiling: Conduct regular data profiling to understand data characteristics and quality.
- Advanced Anomaly Detection: Use advanced techniques (e.g., machine learning) for anomaly detection.
- Continuous Audits: Perform continuous audits to ensure data quality standards are maintained.
Data Governance - Clear Ownership: Define clear ownership and stewardship roles for data quality management.
- Accountability Framework: Establish an accountability framework for data quality management.
- Policy Development: Develop and enforce comprehensive policies for data quality management.
Compliance - Regular Compliance Audits: Conduct regular compliance audits to ensure adherence to regulations.
- Detailed Documentation: Maintain detailed documentation of data quality practices and policies.
- Training Programs: Implement training programs to educate stakeholders on data quality and compliance.

Data Lineage#

Definition : Data lineage refers to the tracking and visualization of the flow of data through various stages of its lifecycle, from its origin to its final destination. It provides a detailed map of how data moves, transforms, and is utilized within an organization. This includes:

  • Source Identification: Identifying the original source of the data.
  • Transformation Tracking: Documenting all transformations and processes the data undergoes.
  • Movement Mapping: Mapping the movement of data across different systems and environments.
  • Usage Documentation: Recording how and where the data is used, including any outputs or reports generated from it.

Guardrails#

Category Description
Tracking Data Transformations - Consistency: Ensure all data transformations are consistently tracked across the pipeline.
- Accuracy: Maintain accurate records of data transformations to ensure traceability.
- Completeness: Ensure that all steps in the data transformation process are documented.
Metadata Management - Standardization: Use standardized metadata formats to ensure compatibility and ease of integration.
- Security: Protect metadata from unauthorized access and modifications.
- Retention: Implement policies for the retention and archiving of metadata.
Compliance - Regulatory Adherence: Ensure data lineage practices comply with relevant regulations (e.g., GDPR, HIPAA).
- Auditability: Maintain audit trails for all data transformations to support compliance audits.
Data Governance - Ownership: Clearly define data ownership and stewardship roles.
- Accountability: Ensure accountability for data transformations and lineage tracking.
Tool Integration - Compatibility: Ensure data lineage tools are compatible with existing data processing and storage systems.
- Interoperability: Facilitate interoperability between different tools and platforms.

Best Practices#

Category Description
Tracking Data Transformations - Automated Tracking: Implement automated tracking mechanisms to capture data transformations.
- Granularity: Track data transformations at a granular level to ensure detailed lineage.
- Visualization: Use visualization tools to represent data lineage clearly and intuitively.
Metadata Management - Centralized Repository: Maintain a centralized repository for metadata to ensure easy access and management.
- Enrichment: Enrich metadata with additional context to improve understanding and usability.
- Versioning: Implement version control for metadata to track changes over time.
Compliance - Regular Audits: Conduct regular audits of data lineage practices to ensure compliance with regulations.
- Documentation: Maintain comprehensive documentation of data lineage processes and policies.
Data Governance - Policy Development: Develop and enforce policies for data lineage and governance.
- Training: Provide training to stakeholders on data lineage practices and tools.
Tool Integration - Unified Platform: Use a unified platform for data lineage to streamline integration and management.
- API Utilization: Leverage APIs to integrate data lineage tools with other systems and platforms.

Using Databricks notebooks#

Definition : Using Databricks notebooks offers several advantages over writing pure Python code, especially in the context of data engineering, data science, and machine learning.

Guardrails#

Topic Description Recommendation
Single Machine vs Cluster Regular Python code run externally to databricks work on a single machine while run through databricks run on a cluster of machines (if installed with Dask or Ray on clusters). This make code faster in databricks as compared to single machine Databricks clusters now support Ray which can help Python code use distributed computing and processing (as PySpark) and Dask is Python
Distributed work load Use python for distributed workloads, making full use of cluster. For single machine workloads, python code will work, but will have limited resources to run the code Additional libraries like Ray or Dask help python use full use of Clusters, where as Pyspark uses Databricks cluster for distributing workload
Dataframes - Pandas Pandas Dataframe able to Data Manipulation such as indexing, renaming, sorting, merging data frame. Updating, adding, and deleting columns are quite easier using Pandas. Pandas Dataframe supports multiple file formats Processing Time is too high due to the inbuilt function. Dis-Advantages: Manipulation becomes complex while we use a Huge dataset. Processing time can be slow during manipulation. While Pandas dataframes comes with lot of exciting functions, large dataset become slower while processing
Dataframes - PySpark Spark carry easy to use API for operation large dataset. It not only supports ‘MAP’ and ‘reduce’, Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc. Spark uses in-memory(RAM) for computation. It offers 80 high-level operators to develop parallel applications. Dis-Advantages : No automatic optimization process, Very few Algorithms, Small Files Issue PySpark works well with large dataframes but with limited number of functions. They are good for Data processing and Transformation

Best Practice#

Python Vs PySpark#
PySpark Python
PySpark is easy to write and also very easy to develop parallel programming Python is a cross-platform programming language, and one can easily handle it
One does not have proper and efficient tools for Scala implementation As python is a very productive language, one can easily handle data in an efficient way
It provides the algorithm which is already implemented so that one can easily integrate it As python language is flexible, one can easily do the analysis of data
It is a memory computation It uses internal memory and nonobjective memory as well
It only provides R-related and data science-related libraries It supports R programming-related libraries with data science, machine learning, etc libraries too
It allows distributed processing It allows to implementation a single thread
It can process the data in real-time It can also process data in real-time with huge amounts
Before implementation, one requires to have Spark and Python fundamental knowledge Before implementation, one must know the fundamentals of any programming language
It shows low latency It is easy to learn and use
It is immutable It is a cross-platform language
It is fault tolerant It is easy to maintain
It supports Spark, Yarn, and Mesos cluster managers It is dynamically typed
It has ANSI SQL support It has large community support
It is dynamic in nature It has extensible features
It is hard to express It might be slower because it is an interpreted language
Less efficient Threading of Python is not optimal due to Global Interpreter Lock
If one requires streaming, then the user has to switch from Python to Scala It consumes a lot of memory
It is not supported by Android or iOS