Data unification in mdm
Go to Playbook Main Page
Go back to Playbook Design Page
Data Unification in MDM#
The unification layer in MDM is responsible for consolidating and harmonizing data from multiple source systems to create a single source of truth .In this layer , cleansed and standardized data from the landing area is loaded into MDM for Unification . This layer focuses on designing jobs for loading the data into MDM Data Model format , defining data quality and validation rules ,configuring the match and survivirship rules , enabling fields for elastic search indexing ,reference data checks defining user/role centric UI , configuring workflows for data create/update and resolving potential matches .
Match and Merge#
Key Ideas
- Match and merge algorithms must minimize false positives/overmatching(incorrect matches) and false negatives/undermatching (missed matches).
- Define robust survivorship rules to choose the most trusted source for conflicting data during golden record creation.
- Ensure manual review workflows for vaguely suspected duplicates below confidence thereshold or borderline cases.
- Optimize match and merge processes to handle large datasets efficiently.Support incremental matching (identify matches only on new or updated records).Use partitioning, parallel processing, and indexing strategies for large datasets.
- Ensure that every match/merge decision is auditable and traceable.Maintain logs of matched records and provide match lineage (details of which data sources contributed to the golden record).
- Allow reversals of incorrect merges(unmerges).
- Ensure matching rules account for multilingual data.Handle case insensitivity, regional naming conventions, and transliterations (e.g., "José" vs. "Jose").
Matching#
| Best Practices | Guardrails |
|---|---|
| Standardize data before matching. Use CDQ or pre-processing steps to clean names, addresses, and emails. | Don’t match unstandardized or raw data; it reduces match accuracy. |
| Use hierarchical match rules. Start with exact, then fuzzy rules. | Avoid overly loose fuzzy match rules at the top — it increases false positives. |
| Leverage ML Matching (AutoML). Enable AI-assisted matches for better flexibility in fuzzy scenarios. | Don’t rely solely on fuzzy logic without human validation in high-risk domains. |
| Include match confidence scores. Use thresholds to route low-confidence matches for manual review. | Avoid auto-merging matches below an accepted confidence threshold. |
| Use key identifiers early. Prioritize NPI ID,HIN, DEA number, Email, or Phone in top match rules. | Don’t depend only on demographic fields for entity matching (e.g., Name + City). |
Merging#
| Best Practices | Guardrails |
|---|---|
| Define clear merge policies: Passive (no auto-merge) or Active (auto-merge). | Don’t mix auto-merge and manual merge without governance. |
| Use survivorship-friendly merge strategy. Ensure merge decisions align with survivorship logic. | Avoid merging records with conflicting critical data (e.g., NPI ID mismatch). |
| Enable audit trail for merge events. Keep logs of who merged what and why. | Don’t allow anonymous or unlogged merges in GxP/sensitive environments. |
| Set merge thresholds. Require steward intervention below certain confidence levels. | Don’t auto-merge if confidence scoring isn’t calibrated. |
| Validate merged records. Use post-merge validation checks (e.g., uniqueness, completeness). | Don’t skip validation after merge; it may propagate bad data. |
Survivorship#
| Best Practices | Guardrails |
|---|---|
| Survivorship at attribute level. Each attribute (email, name, address) should have its own rule. | Avoid "one-rule-fits-all" survivorship logic. |
| Use source priority and trust score. Rank systems like ERP > CRM > Web. | Don’t treat all systems equally if data quality varies. |
| Include fallback logic. E.g., most recent non-null, or longest value. | Don’t leave attributes blank if the preferred source is null.Use validation rules. |
Match and Merge Configuration in IDMC MDM SaaS#
The following topics details out on the capabilities and options available in IDMC MDM SaaS , highlighting use cases for the available options . Key terms used in MDM SaaS configuration
| Term | Description |
|---|---|
| Match | To identify potential duplicate records (fuzzy or exact) using match rules. Match rules dictate the logical algorithm to be used in the identification of duplicates. In MDM SaaS, you can have declarative rules or machine learning rules or both |
| Merge | To consolidate duplicates identified in the match step into a single golden record using survivorship rules. |
| Cluster | To group similar records into potential duplicates before final merge |
IDMC MDM SaaS configuration starts with defining the Match model .
Match Model = Match rules + Merge Setting
Define Match Model#
Match Model Properties#
| Component | Description |
|---|---|
| Model Name | Distinct name of the model. Refer to the naming-standards.md for the match model name. |
| Model Objective | Dropdown. Set Resolve duplicates. |
| Population | The match model is governed by the population file. It has a definite characteristic about the data to match along with the logic to generate match keys that support the country and language of the population file chosen. By default, the USA is selected. This should be updated based on the data set we are consolidating. |
Candidate Selection Criteria#
Selects the candidates as per the requirements and the profiling report
| Component | Description |
|---|---|
| Field name /Match Column | Fields used to compare (e.g., name, email, phone). This match tokens will be generated based on the field names selected . |
| Field Type | Specifies the type of data that is contained in the business field name, such as Person_Name, Address_part1, Code. Refer to the Informatica MDM documentation for a full list of supported data types and field classifications. |
| Key generation level | Defines the level of thoroughness used by the system to generate the match key for candidate selection, such as standard, limited, extended. This determines the number of keys that get generated for the record. |
| Candidate search level | Specifies how thoroughly the system searches for potential match candidates. Options include: narrow, typical, exhaustive. |
| Weights | Importance of each column in computing score. |
Key Selection Criteria
| Key level | Match Token Quantity | Source Data Volume | Source Data Quality | Time taken for key generation |
|---|---|---|---|---|
| Extended | Large | Not Large | Poor | Large |
| Limited | Small | Very Large | Good | Less |
| Standard | Medium | Medium | Medium | Medium |
Candidate Search Level Criteria
| Key level | No of match candidates found | Source Data Volume | Criticality | Time taken to match |
|---|---|---|---|---|
| Extreme | Most | Small | very high | Very high amount of time |
| Exhaustive | Many | Small | high | High amount of time |
| Typical | Moderate | Medium to large | Medium | Medium |
| Narrow | Moderate | very large | low | Fast |
Design Match Rule Sets in MDM SaaS#
Declarative Match Rules#
| Component | Description |
|---|---|
| Match Strategy | Exact(Deterministic) , Fuzzy (Probabilistic). |
| Match Criteria | Defines the business purpose for match and drives the evaluation logic such as resident, organization, contact. Refer to the Informatica MDM documentation for all available options. |
| Match Level | Indicates the amount of variability between the records while still allowing them to match examples - conservative, loose, typical. |
| Rank | Sets the order in which the match rule should run. |
| Merge Strategy | Automated merge (records merged without review), manual merge (human review and approval), skip (match is ignored), threshold (fuzzy rule with score ranges for each type of merge). |
You can chain multiple rules (e.g., match on email OR (name + phone)).
Iterative Finalization of the Match Model#
Finalizing the match model and rules should be treated as an iterative process. Always run the match model on the complete dataset if possible, and collaborate closely with Business Analysts and stakeholders to evaluate outcomes. Assess whether the model is under-matching or over-matching, and adjust configurations accordingly to strike the right balance between precision and recall.
Machine Learning (ML)#
Machine learning models in MDM can be trained with help from business users, who label record pairs as matches or non-matches. This human feedback helps the model learn how to identify duplicate records accurately.
Key Factors for Building an Effective ML Match Model#
- High-Quality Training Data
Select the right data for the training set. Ideally, the training set should include at least 1,000 record pairs representing a diverse range of match scenarios. Include: - Clear matches
- Clear non-matches
- Close-but-should-match cases
-
Close-but-should-not-match cases
Pull real production data wherever possible to ensure realistic examples. -
User Input and Labeling
Business users label records during training using the following options: - Match
- Not a Match
- Not Sure
- Need More Data
The accuracy of these labels is critical to the quality and performance of the resulting model.
| Component | Description |
|---|---|
| Match Strategy | Exact(Deterministic) , Fuzzy (Probabilistic) |
| Match Criteria | Defines the business purpose for match and drives the evaluation logic like resident, organization, contact. Refer to the Informatica MDM documentation for all available options. |
| Match Level | Indicates the amount of variability between the records while still allowing them to match examples - conservative, loose, typical. |
| Rank | Sets the order in which the match rule should run. |
| Merge Strategy | Automated merge (records merged without review), manual merge(human review and approval), skip (match is ignored), threshold (fuzzy rule with score ranges for each type of merge). |
Define Merge Rules (Survivorship)#
Note: For grouped fields such as Addresses, survivorship rules are applied to the entire group—not to individual fields within the group. This means the selection logic determines which full address set survives, rather than field-by-field evaluation.
- Rules will always ovveride the source rank configuration.
- Decay only applies at the time of merge, un-merge, or update event.
- Use logical trust buckets when there is a large number of source systems.
- Informatica 360 source should always have the highest maximum trust and quickest decay for data steward changes.
| Field | Merge Rule Type |
|---|---|
| Source Ranking | Rank a source system based on reliability. The data from the highest ranked source system is more reliable than the other source systems. |
| Rule | Data quality validation rules can be used to downgrade the calculated trust score. These rules should test for desirable conditions, such as: null, value that match a pattern, values that are not active, etc. |
| Decay | Decay rate determines how fast the trust score decreases over time. |
| Country | Trusted Source |
| Birthdate | Earliest |
At the end of trust score calculations, if multiple records have the same score, the record with the most recent Last Update Date (LUD) is selected.
If there is still a tie, the most recent Create Date is used as the final tiebreaker.
| Field | Merge Rule Type | Priority Order |
|---|---|---|
| Most Recent | Based on update timestamp | |
| Phone | Longest Value | Prefer landline |
| Middle Name | Most Complete | Longest non-null value |
| Country | Trusted Source | Prefer Source System A |
| Birthdate | Earliest | Prefer earliest valid birthdate |
Define Business Events#
!!! success "Key Ideas"
* Design reusable workflows for recurring operational tasks.
* Assign tasks dynamically to relevant users (e.g., data stewards, data owners) based on rules or roles.
* Ensure workflows reflect real-time changes to master data.
* Workflows must include mechanisms to manage exceptions like duplicate records or validation failures.Allow logs or comments to capture why exceptions were approved or rejected.
* Ensure all workflow actions are logged for compliance and traceability.Capture metadata like who completed each task, timestamps, and comments.Provide audit trails for regulatory compliance .
* Provide clear indicators of pending, in-progress, and completed tasks
* Implement undo options to reverse accidental changes.
* Allow bulk approval or rejection actions for efficiency.
* Implement timely notifications for pending tasks and deadline escalations.Use email or system notifications to alert users of task assignments or overdue actions.
- In MDM SaaS, workflows are handled using Business Events. A business event can trigger a workflow that consists of multiple tasks linked together. Business Events also support human interaction.
- There are two primary types of business events:
- User-triggered events - Actions performed directly by users, such as creating or updating records or managing hierarchies.
- System-generated events - Automated actions performed by the system, such as detecting potential record matches that require manual review.
To configure a business event in MDM SaaS, consider the following components:
-
User Roles
Define which user roles are authorized to trigger the event. -
Actions
- User-triggered:create,update,delete,unmerge
- System-generated:resolve potential matches
(These are the currently supported actions in MDM SaaS.) -
Assets
Identify the master data entity on which the event is triggered (e.g.,Person,Organization). -
Workflows
Select a supported workflow type: - One-step workflow - Two-step workflow -
Tasks
Define the tasks that will be created as part of the workflow. -
Task Properties
Configure what users will see in their task inbox: - Title - Make it dynamic by including the record name and business event name to ensure clarity. - Priority - Set based on business urgency. - Due Date - Define expected completion timelines.
Note - For customization of Business Events, CAI workflows can be leveraged.
MDM UI#
!!! success "Key Ideas"
* Design User centric UI - Prioritize the needs of different user roles .Customize views for different user roles (e.g., data stewards see pending tasks ,data managers see review tasks).
* In the UI ,present master data in a format that allows users to quickly assess quality, relationships, and completeness.
* Allow users to search for records or filter tasks based on context-specific criteria.
* Restrict data visibility and actions to authorized users- Implement role-based access control (RBAC) for sensitive data
* Display real-time updates on workflows, data quality metrics, and exceptions.
* Include widgets for KPI tracking.Provide alerts for critical issues.
* Configure and manage shared datasets and controlled vocabularies as Reference data in R360 that allows alignment and governance of these critical data sets across enterprise.
Reference Data Management#
!!! success "Key Ideas"
* Provide a central hub or domain for all codes and classifications (like country codes, industry codes, status values).
* Implement workflows and roles to control who can add, modify, or deactivate codes.
* Model parent-child relationships (like Country > States, Categories > Subcategories).
* Handle future-dated and inactive codes gracefully.
* Provide well-documented API endpoints for retrieving and validating codes in real time.
* Provide a specialized UI for data stewards to manage codes safely.
* Support search, filter, add, deactivate, and approve operations.
* Provide well-documented API endpoints for retrieving and validating codes in real time.
* Provide a specialized UI for data stewards to manage codes safely.
* Support search, filter, add, deactivate, and approve operations.
* Support for **Scope - Global/Local** and **Code List** Reference Data Sets
* The data model should allow a “Scope” indicator (like `global` or `local`) to be attached to each code to accommodate both enterprise-wide standards and local variation.
* It should be a hierarchical Model where Parent (global) set is defined first and Local set extends or overrides only where necessary.
* Global and Reference data sets should be controlled by separate roles and workflows.
* The API should allow **filter by scope**, e.g.:
`GET /reference-codes?scope=global`
`GET /reference-codes?scope=local®ion=EU`
* `global` Reference data should be controlled by a central enterprise team and `local` by Regional or Departmental Stewards
* For controlled data sets that are NOT enterprise standards but are required for a specific project , it is recommended to use a separate `CODE LIST` reference Data set.
* The `CODE LIST` reference data type should follow a naming convention of `CODE_LIST_<>` and should be owned and governed by the project team . The `CODE LIST` reference Data set should not be exposed via the generic API's
| Role | Responsibilities |
|---|---|
| Reference Data Owner | Owns the set of codes. Approves additions, modifications, and deletions. Ensuring codes reflect business needs. |
| Reference Data Steward | Initializes requests for new codes, reviews submissions for completeness and consistency, performs lightweight validation before approval. |
| MDM Administrator | Maintains the configuration in MDM UI, performs import/export, assists in technical implementation, and controls API, roles, and permissions. |
| Governance Council / Approval Team | Final approval for enterprise-wide codes or when a “local” set might affect enterprise; reviews for potential overlap, redundancy, or future enterprise use. |
| Consumer Application Owner | Consumes the codes and must be kept informed of additions or deletions; may raise requests if a code set needs to be updated. |
| Project Team | Owns and maintains the CODE LIST reference data . |