Go to Playbook Main Page
Go back to Playbook Design Page
Data Ingestion for MDM#
Data ingestion is the process of collecting and importing data from various sources into the MDM system. MDM requires curated data that fits its data model.This means that data in source systems can't be directly written to MDM and requires transformations and checks. As such an ingestion layer becomes a must have. Once data has been profiled and data model is decided, one or more ingestion modes can be chosed
Note
Data can also be ingested directly in MDM via User Interface or UI based file import options.
Data creation /updation via UI is generally governed using workflows .This funtionality is typically used by Data Stewards.
During file import , the system applies data quality rules defined in the data model. This functionality is typically used by super users or application administrators.
Key Ideas
- Ability to handle large volumes without bottlenecks.
- Support both batch and real-time/near real-time methods.
- Implement retry mechanisms.
- Loading should be idempotent — if you replay messages or files, it should not create duplicates or corrupt the master.
- Ideally use a business key or composite key to match-up existing records.
- Use Modular Layers - separate responsibilities into RAW, LANDING,STAGING and LOAD layers to enable easier debugging, recovery, and pipeline maintenance.
- Handle inserts, updates, deletions gracefully.Support delta detection from sources to minimize latency and processing effort.
- Stick to open, well-supported formats (JSON, CSV, Avro, Parquet).
- Compress and partition large files by key or timestamp to aid processing.
- Implement proper authentication and authorization (API tokens, service accounts) for ingest.
- Handle PII and GDPR by masking or encrypting sensitive fields at the earliest point.
- Ensure proper error handling by categorizing and logging the errors to a central error store.
Ingestion Patterns#
Data ingestion modes can be classified based on two main criteria: data type. and ingestion frequency
Ingestion Based on Data Type#
This classification considers the format and structure of the data being ingested, impacting the method used for data collection.
- Database Sources:
-
This pattern extracts data directly from databases using connectors.
-
File-Based Sources:
-
This involves ingesting data from files stored in cloud storage or on-premises systems. It's often used for structured or semi-structured data like CSV, JSON,Excel or Parquet files
-
API-Based Sources:
-
Data is ingested through APIs, often from external systems or third-party services. This pattern is common for integrating diverse data sources.
-
Event-Driven Sources:
- This pattern triggers data ingestion based on specific events, such as a new file upload or a database update. It's efficient for handling dynamic and unpredictable data flows
Ingestion Based on Frequency#
This classification focuses on how often data is ingested
- Batch Mode:
Batch mode is the most common option and should be chosen for bulk loads. Any process reading more than a few thousand records that need to participate in match and merge should use this mode. Examples of batch-mode processes include:
- Initial Data Loads.
- Near-full refresh of a base object entity due to changes in source systems.
- Bulk data manipulations in source systems, such as generating batch numbers and assigning them to multiple products - where processes run automatically without human intervention.
It is important here to avoid race condition - hence ensure to design the ingestion pipelines to handle repeated inputs or updates gracefully without affecting data integrity
- Event-Driven and Near Real-Time Mode:
Use this mode when Master Data must update in response to events from source systems. Events can occur at any frequency, and the MDM system needs to process changes promptly. The distinction between near real-time and true real-time depends on how quickly updated data is required and how master data is consumed across the organization.
Examples of Event-Driven Use case: 1. A customer updates their email in a subscription portal. MDM must ingest and propagate this change quickly, for example, to a promotional system that sends nightly offers. 2. A product has an updated attribute that can drive sales. This update needs to be sent to MDM ASAP so all systems using this product can consume the updated attribute. E.g., a face cream gains a UV advantage due to a new formula. This information can be sent to marketing systems, which can quickly launch a new campaign and directly affect sales.
- Real-time Mode:
Real-time integration is highly use case-specific. In most scenarios (around 90%), near real-time or event-driven approaches offer a more cost-effective solution. Real-time should be used only when there is a need to push data to or from MDM in direct response to an end-user action.
Examples of Real time use case:
1. A sales representative is creating a customer in the CRM. They need to check if the customer already exists. In this case, real-time API calls can be made to search and, if necessary, create the customer in MDM using minimal input (e.g., mobile number). This ensures that customer data originates in MDM and avoids duplicate creation by another rep in the future.
2. An organization works with suppliers that may operate from multiple locations within a region. If a supplier updates their warehouse address, the supplier manager makes the change, which is pushed to MDM. From there, it can be propagated to logistics systems to trigger alerts and avoid potential disruptions. While near real-time could work in this scenario, real-time integration is ideal—allowing for instantaneous decisions such as call-backs or rerouting at distribution hubs.
Choosing the Right Pattern#
Factors to Consider:#
- Frequency of Updates/Data Velocity: - Real-time ingestion for immediate updates needed for user-driven decisions or transactional systems. - Near Real-timeFrequent events where data freshness is important but not immediate. - Batch ingestion for infrequent or periodic updates.
- Volume of Data: - Batch ingestion is more efficient for large datasets. - - Manual ingestion via workflows for very small data sets.
- Processing Complexity: - Batch mode allows for complex transformations on entire datasets. - Real time api and events are typically meant for simpler, ongoing processes.
- Business Need: - Real-time event-driven ingestion for critical time-sensitive processes. - Hybrid solutions for complex environments with multiple source types.
- Regulatory or Compliance Constraints: - Secure file-based ingestion or encrypted API-based integration for sensitive data (e.g., drug trial data or patient PII). - Ingestion via workflows for data that require manual checks .
Technology Choices#
Informatica IDMC#
IDMC provides microservices for both Cloud Application Integration (CAI) and Cloud Data Integration (CDI). The choice between them depends on the specific use case and integration requirements.
When to Use CDI (Cloud Data Integration)
Use Case: Data movement, transformation, and synchronization — primarily in batch or micro-batch modes.
Common Scenarios:
- Loading master data from Internal Data Bases, files, Data warehouse or lake loading (Snowflake, Redshift, BigQuery). etc., into MDM SaaS.
- ETL/ELT jobs to integrate, cleanse, and transform data.
- Initial and periodic bulk data ingestion into MDM.
- Performing complex data mapping/transformation logic.
- Schedule-based jobs or triggered via APIs/Taskflows.
Characteristics:
- High-volume data processing.
- Supports mapping, parameterization, and reusability.
- Optimized for performance and scalability.
- Can be embedded in MDM SaaS ingestion jobs.
When to Use CAI (Cloud Application Integration)
Though CAI provides API-centric integration capabilities, it is recommended to use Mulesoft for integration with external systems . CAI however should be used for operations , integrations of process flows,exception handling inside of MDM
Other Tools#
- MuleSoft: For API-led ingestion and integration patterns,listening to events,building composite APIs
- PySpark/Flink with Kafka: For high-volume real-time or streaming ingestion.
- SAP Event Mesh: For event-driven architectures in SAP-integrated ecosystems.
Tip: The right tool depends on latency needs, volume, and system compatibility. Use Mulesoft for API/event orchestration, CDI for batch jobs, and Kafka/Flink for high-throughput stream processing.
Key Components of an Ingestion Layer#
1. Raw Layer /Pre-Landing Layer#
- Function: To read and store the data AS-IS in MDM.
- No transformation
- Typically stores data as-is (in blobs, strings, or semi-structured formats)
- May include metadata like filename, ingestion time, batch ID
-
High-Level Processes:
-
Source Delta Detection : Identifying the delta from source by maintaining the last datetime of source data comsumption.
- Schema Validation: Ensure the data conforms to expected schemas (e.g., column names, data types).
- Initial Event Logging: Log events like ingestion errors for troubleshooting.
- Error Handling: Route invalid data to an error file or table for reprocessing.
-
Temporary Cleanup: Normalize raw data formatting (e.g., character encoding, trimming white spaces).
-
Use Case
| Scenario | Why Raw Layer Is Used |
|---|---|
| Multiple source formats (CSV, JSON, XML, API) | To capture unprocessed source data exactly as received |
| Audit/regulatory needs (21 CFR Part 11, IDMP, GxP) | Enables full traceability back to the original record |
| Data from unstable/untrusted systems | Acts as a backup to recover from ingestion or transformation errors |
| Reprocessing needs (replay logic) | Allows re-ingestion without recalling from source |
-
Output: Unstandardized data moved to the landing layer for refinement and further processing.
-
Naming Standards: (prefix
RAW_) for components belonging to this layer
2. Landing Layer#
-
Function: Prepares, cleanses, de-duplicates, and transforms source data to be ready for mastering. When reading data from a variety of source systems, it must be transformed to align with the logical data model defined for MDM. The landing layer serves as the initial restructuring stage, mapping raw data into a format that mirrors the logical MDM model. This includes handling dependencies between sources, combining records, and applying business-rule-driven transformations.
-
High-Level Processes:
-
Data Cleansing and Standardization: Before data is written into MDM, it must meet specific quality standards. This includes validation, cleansing, and standardization processes, which may involve third-party services. The Quality Layer is often tightly integrated with the Landing Layer and may operate in parallel. It should implement the following controls:
- Apply standard formats (e.g., dates in
YYYY-MM-DD, consistent case for text). - Address formatting discrepancies.
- Phone number standardadization
- Leverage existing data quality rules and frameworks to ensure alignment with organizational standards.
- Apply standard formats (e.g., dates in
-
Use enterprise reference data where available to enforce consistency.
-
Data Transformation:
- Apply mapping rules to convert source data to target (MDM Logical Data Model) formats.
- Enrich records with derived data or simple calculations.
- May include status flags (e.g., is_valid, processed, rejected
-
Delta Detection and Search Before Create:
- Inorder to limit the number of records ingested to MDM and improve performance , it is recommended to identify true delta (records which have not changed as per the MDM Data model format ) and remove duplicates .
- Identify and remove exact duplicates.
- Enable Search before create by searching against the master collection to preclude duplicates at the point of entry.
- Pre-group similar records for further mastering.
-
Error Handling:
- Route problematic records to an error log.
- Generate exception reports for review.
-
- Reject records that are missing mandatory or business-critical fields, with appropriate error logging and notification.
- Log errors in Data Quality tables (prefix
LND_DQ_) with clear messages and timestamps.
Different source systems may have separate ingestion pipelines, but all should ultimately populate a common set of landing tables, designed according to the logical data model.
This layer should include:
- Landing tables (prefix
LND_) aligned to the logical model. - Data pipelines (ETL, streams, or APIs) from each source system writing to those landing tables.
- Use intermediate tables (INT_LND_) where needed to reduce complexity. - Quality and standardization orchestration, with rules to reject invalid data early.
-
Audit tracking for traceability and compliance.
- Use Case
| Scenario | Why this Layer Is Used |
|---|---|
| Apply data quality validations (null checks, format rules) | Prepares data for business-level transformations in MDM |
| Handle source-to-mdm logical model mapping | Converts source schema into MDM-ready structure |
| Support match & merge logic | Feeds normalized data to MDM matching engine |
| Filter duplicates or invalid rows | Ensures only clean, structured data enters MDM |
Examples: - Standardizing and augmenting the address data with AVC score via Informatica Address Verification Service - Splitting a full name field into first, middle, and last names to enforce naming standards.
Output: Standardized, cleansed, and pre-processed data prepared for the mastering/unification layer.
History Layer (Optional)#
The History Layer is used to track changes over time for auditability, rollback, or advanced analytics. While optional, it becomes essential in regulated environments or when historical context is valuable. Key aspects include:
- Versioned storage of records for change tracking.
- Timestamps and change reason metadata.
- Comparison logic to detect what changed and when.
Data Placement Across Layers:#
-
Regardless of mode, all data should pass through:
-
Optional Raw Layer: Store AS-IS from source
- Landing Layer: Logical model alignment and Validation and standardization -
- Optional History Layer: If change tracking is needed
- Load Layer: Final stage before ingestion into MDM