Skip to content

Building Data Product#

Overview#

The Building Data Product phase transforms your approved requirements into a fully functional data product through systematic development, testing, and deployment. This phase focuses on implementing robust data pipelines, establishing observability, and ensuring your data product meets all technical and business requirements.

What you will learn#

After completing this chapter, you will understand how to:

  • Implement secure data access patterns and connectivity to source systems
  • Build scalable data pipelines using modern ETL/ELT frameworks
  • Establish comprehensive observability and monitoring for your data products
  • Implement data quality controls and automated testing
  • Create proper documentation and data lineage tracking
  • Set up output ports for downstream consumption
  • Follow CI/CD best practices for data product deployment

Key Personas & Stakeholders - RACI Matrix#

Activity Data Product Manager Data Engineer Solution Architect DevOps Engineer Data Analyst Business Stakeholder
Pipeline Implementation A R C C I I
Data Quality Setup C R C I A C
Observability Implementation C R A R I I
Documentation Creation A R C I C C
Output Port Configuration C R A C C I
CI/CD Pipeline Setup C R C A I I
Testing & Validation A R C C R C

R = Responsible, A = Accountable, C = Consulted, I = Informed

Prerequisites#

Step-by-Step Process#

Step 1: Secure Data Source Access#

Objective: Establish secure, reliable connections to all required data sources and validate data availability.

  1. Identify and Access Data Sources:

    Here are some common data sources you may need to connect to:

    • Database systems: SQL Server, PostgreSQL, Oracle, MySQL
    • Cloud storage: Azure Blob Storage, AWS S3, Google Cloud Storage
    • APIs and web services: REST APIs, GraphQL endpoints, SOAP services
    • File shares and data lakes: Network file systems, HDFS, Delta Lake
    • Message queues: Kafka, Azure Service Bus, AWS SQS
  2. Configure Authentication and Authorization:

    • Set up service accounts with principle of least privilege
    • Configure OAuth 2.0, API keys, or certificate-based authentication
    • Avoid using Personal Access Tokens (PATs), use service principals or managed identities
  3. Test Connectivity and Data Availability:

    • Test authentication mechanisms in target environments
    • Validate data schema and sample data
    • Confirm data refresh schedules and availability windows
    • Implement error handling for connection failures
  4. Document Access Patterns and Requirements:

    Define and document the following:

    • Connection strings and authentication methods
    • Data refresh schedules and availability windows
    • Rate limits and usage quotas
    • Error handling and retry policies
    • Performance benchmarks and SLA requirements

The above should be part of your technical design specifications. You can document these in:

  • Manual setup: Create design_specifications.md under the documentation/ folder in your repository
  • Template-based setup: Use dc-template-core which provides structured documentation templates and placeholders for design specifications
  • Deployment documentation: Follow dc-release documentation guide for environment-specific configuration and deployment specifications

Security Best Practice

Use Databricks secrets or similar secret management services to store connection credentials. Never hardcode sensitive information in your code.

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Tip

The dc-template-python-project provides a quick start template with a deployable python project setup where you can easily build these type of transformations for your use case.

Expected Outcome: Verified access to all data sources with documented connection patterns

Step 2: Implement Data Pipeline Architecture#

Objective: Build robust, scalable data pipelines following established patterns.

Source to Landing Zone Implementation#

The landing zone serves as a buffer layer that ingests raw data from source systems without transformation. This stage is essential when you need to decouple data ingestion from processing, handle varying data arrival patterns, or maintain an immutable copy of source data for audit and reprocessing purposes.

When to implement Source to Landing:

  • When source systems have limited availability windows
  • When you need to process large volumes requiring different processing schedules
  • When regulatory requirements mandate raw data preservation
  • When multiple downstream consumers need access to the same source data
  1. Set up raw data ingestion pipelines: - Configure scheduled data extraction jobs - Implement streaming ingestion for real-time sources - Set up file monitoring for batch file arrivals - Handle different data formats (JSON, CSV, Parquet, Avro)

  2. Implement change data capture (CDC) where applicable: - Configure CDC for transactional systems - Set up incremental extraction based on timestamps - Implement watermarking for streaming data - Handle deletes and updates appropriately

  3. Add data validation at ingestion point: - Verify file integrity and completeness - Validate basic schema compliance - Check for required fields and data types - Implement data quality alerts for critical failures

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Landing to Bronze (Source-Aligned Data Products)#

The bronze layer serves as your source of truth for raw, validated data. Here's how to implement the transformation:

  1. Apply data cleansing and standardization: - Remove or fix malformed records - Standardize date formats and data types - Handle null values and missing data - Apply consistent naming conventions

  2. Implement data partitioning and bucketing strategies: - Partition by date/time for time-series data - Use appropriate partition sizes (avoid small files) - Implement bucketing for large datasets - Optimize for downstream query patterns

  3. Implement load strategies: - Full refresh: Complete dataset replacement (for small, static datasets) - Incremental load: Process only new/changed records (for large, dynamic datasets) - Merge/Upsert: Handle updates and inserts (for transactional data) - Configure appropriate checkpointing and recovery mechanisms

  4. Add audit columns for tracking: - Source system identifiers - Ingestion timestamps - Data lineage information - Processing batch/job identifiers

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Tip

The dc-template-python-project provides a quick start template with a deployable python project setup where you can easily build these type of transformations for your use case.

Bronze to Silver/Gold Transformations#

The silver and gold layers apply business logic and create consumption-ready datasets. Silver typically contains cleaned, conformed data suitable for analytics, while gold provides aggregated, domain-specific datasets optimized for specific business use cases.

  1. Create Physical Model if needed: Read here to know more

  2. Apply business logic transformations: - Implement business rules and calculations - Create derived fields and computed columns - Apply data masking for sensitive information - Standardize business terminology and definitions

  3. Implement aggregations and calculations: - Create pre-computed metrics and KPIs - Build rolling averages, cumulative sums, and trends - Implement time-based aggregations (daily, weekly, monthly) - Generate statistical summaries and data profiles

  4. Join multiple data sources as required: - Combine related datasets using appropriate join types - Create comprehensive customer or product views

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Tip

The dc-template-python-project provides a quick start template with a deployable python project setup where you can easily build these type of transformations for your use case.

Expected Outcome: Functional data pipelines processing data through all required layers

Step 3: Establish Comprehensive Observability#

Objective: Implement monitoring, alerting, and performance tracking for your data pipelines.

  1. Define Key Metrics:

    • Data Quality Metrics: Completeness, accuracy, consistency
    • Performance Metrics: Processing time, throughput, resource utilization
    • Business Metrics: Record counts, data freshness, SLA compliance
  2. Implement Pipeline Monitoring:

  3. Set up Alerting Rules:

    • Configure webhook notifications via Databricks Asset Bundles using dc-template-core and dc-template-python-project templates.
    • Data quality threshold violations
    • Performance degradation alerts
    • Data freshness SLA breaches
  4. Create Monitoring Dashboards:

    • Real-time pipeline status
    • Historical performance trends
    • Data quality scorecards
    • Resource utilization metrics

Expected Outcome: Comprehensive monitoring system providing visibility into pipeline health and performance

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Refer Additional Resources for how to setup observability.

Step 4: Implement Data Quality Controls#

Objective: Ensure data reliability through automated quality checks and validation.

  1. Define Quality Rules: - Completeness checks - Accuracy validations - Consistency rules - Uniqueness constraints

  2. Implement Quality Checkpoints: - Source data validation - Transformation validation - Output data validation

  3. Create Quality Metrics Dashboard: - Quality score trends - Rule violation details - Data profiling results

  4. Set up Quality-based Alerts: - Critical quality failures - Quality score degradation - Anomaly detection

Expected Outcome: Automated data quality monitoring with defined thresholds and alerting

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Refer below Additional Resources to see reference implementation

Step 5: Create Data Lineage and Documentation#

Objective: Provide transparency and traceability for your data product.

  1. Implement Automated Lineage Tracking:

  2. Generate Technical Documentation: - Data schema documentation - Transformation logic descriptions - Pipeline architecture diagrams - Configuration and deployment guides

  3. Create Business Documentation: - Business glossary updates - KPI calculation explanations - Usage guidelines - SLA definitions

  4. Maintain Version Control:

Expected Outcome: Comprehensive documentation and automated lineage tracking

Example

Hold "Alt" / "Option" to enable Pan & Zoom
alt text

Refer below Additional Resources to understand lineage tracking on a tool like Purview.

Step 6: Configure Output Ports and Data Sharing#

Objective: Establish secure, efficient interfaces for downstream data consumption.

  1. Define Output Interfaces:
  • Direct table access (for analytical workloads)
  • API endpoints (for application integration)
  • File exports (for external systems)
  • Data shares (Snowflake secure shares, Databricks Delta Sharing)
  1. Configure Access Controls:
  • Role-based access permissions
  • Data classification labels
  • Usage monitoring and auditing
  1. Create Consumer Documentation:
  • API specifications
  • Schema definitions
  • Usage examples
  • Support contact information

Expected Outcome: Well-defined, secure output ports ready for consumer access

Refer below to see reference implementation

  • [Source Aligned data products]link to come
  • [Customer Aligned data products]link to come
  • [Aggregate data products]link to come

Success Metrics & Checkpoints#

  • Source Connectivity: All data sources accessible with documented connection patterns
  • Pipeline Functionality: End-to-end data flow processing correctly from source to output
  • Quality Controls: Data quality rules implemented with automated monitoring and alerting
  • Observability: Comprehensive monitoring dashboards and alerting systems operational
  • Documentation: Technical and business documentation complete and accessible
  • Output Ports: Consumer interfaces configured with proper access controls
  • CI/CD Integration: Automated deployment pipeline operational with quality gates
  • Performance Validation: Pipeline meets defined SLA requirements under expected load

Common Challenges & Solutions#

  • Challenge: Source system connectivity issues or changing APIs
  • Solution: Implement robust retry logic and connection pooling; establish SLAs with source system owners
  • Prevention: Build comprehensive error handling and establish monitoring for source system availability

  • Challenge: Performance bottlenecks in data processing

  • Solution: Implement incremental processing, optimize joins, and partition data appropriately
  • Prevention: Conduct performance testing during development and establish baseline metrics

  • Challenge: Data quality issues not detected until downstream consumption

  • Solution: Implement comprehensive quality checks at each pipeline stage with immediate alerting
  • Prevention: Define quality rules during requirements phase and implement validation early

  • Challenge: Complex transformation logic difficult to maintain and troubleshoot

  • Solution: Modularize transformations, add comprehensive logging, and create unit tests
  • Prevention: Follow coding best practices and conduct regular code reviews

Next Steps#

During the final stages of implementation, prepare for production launch:

  1. Finalize Training Materials: Create user guides, training presentations, and FAQ documents
  2. Schedule User Training: Organize sessions for different user groups (analysts, business users, technical consumers)
  3. Plan Launch Communications: Coordinate announcements and stakeholder notifications
  4. Prepare Support Structure: Establish escalation procedures and support contact information
  5. Validate Production Readiness: Conduct final end-to-end testing in production environment

The next chapter will guide you through Data Product Verification to ensure your implementation meets all requirements before launch.

Additional Resources#

  • Templates

    • If you want to accelerate your development, you can use these templates to quickly and easily scaffold the project structures, CI/CD pipelines, Databricks Asset Bundle configurations
      • dc-template-core - Core template with project scaffolding and best practices for CICD and Databricks workflows.
      • dc-template-python-project - Python project structure with testing, packaging, and CI/CD. Modify the data processing modules for your use case.
  • Reference Implementations:

  • Observability & Monitoring:

    • How to setup observability: Coming Soon
    • How to setup lineage tracking using Purview: Coming Soon