Data publish from mdm
Go to Playbook Main Page
Go back to Design
Data Publish from MDM#
Key Ideas
- Always publish golden or mastered data โ not raw or intermediate โ to avoid confusion and inconsistencies downstream.
- The canonical view should reflect the current best representation of the entity with resolved duplicates, cleaned attributes, and enriched information.
- Push messages to Kafka or an API when changes or new records become available.
- Adopt Schema-First Approach-Define your data structure in a schema and validate messages against this schema to avoid schema drift and
maintain compatibility across consumers. - Design your messages to be backward compatible by-Adding fields with defaults,not removing or renaming existing fields immediately and providing fallback strategies for older consumers.
- Provide a unique identifier or a composite key in messages.
- Consumers should be able to safely apply messages multiple times without side effects (important for retry scenarios).
- Include metadata fields in messages - Timestamp of change,Source of change,Version number,Trace or message IDs for diagnostics.
- Ideally, messages should reflect a transactionally consistent view of the record โ all related fields should reflect the state at a single point in time.
- Push messages at the level of an entity or master record, not an undifferentiated โbulkโ event.
- Clearly reflect CRUD operations in messages (such as
CREATE,UPDATE,DELETE) if applicable. - Filter and mask sensitive fields (like PII) if not required by all consumers.
- Handle authorization and authentication at the delivery layer (API, Kafka).
Why Publish?#
- Integrate to other systems to provide master data.
- Notify source systems about any improvements or corrections made to their submitted data during MDM processing.
When to Publish?#
- New record is created in MDM
- An existing record in MDM is updated
- When a record is merged,unmerged
- When a record is soft deleted
Publish Patterns#
Data from MDM is generally converted to a defined publish format as defined in Data Contracts/ Avro Schemas/ API Specification etc and made available to consumers in multiple modes- batch and real-time .The integration methods depend on the kind of requirements of the consuming system. but while developing a product is it best to design both batch and real time interfaces and allow the consuming systems to choose the approach as per their requirements. It is then left to the consumers to choose the integration based on the use cases.
Batch Mode
Batch integration is suitable when
- Real-time data updates are not critical and some latency/delayed synchronization is acceptable between updates in the MDM system and the consuming applications.
- Volume is high, MDM Supports both full extract** and incremental load strategies, depending on system capabilities and integration frequency.
Real time /Near real time Mode For use cases that require - 'Zero' latency - Trigger/Event Based - Queuing or Orchestration
| Publication Method | Best Use Case | Data Delivery | Advantages | Challenges | How is it achieved |
|---|---|---|---|---|---|
| Via REST API | - Real-time or on-demand data retrieval. - Interactive system integrations (e.g., CRM,) |
On-demand via Mulesoft API | - Real-time access to master data. - Supports REST APIs for flexible integrations. - Fine-grained control over data consumption. |
- Requires API design and management. - Higher initial setup complexity. - Dendency on API consumers for performance. |
- REST API of MDM are onboarded to Muelsoft MAP to allow consumers to read data in real time. - CAI workflows can be used to transform the MDM data into the target api specification format. |
| Via Events | - Real-time, event-driven data delivery. - Distributed systems or microservices requiring synchronized updates. - For integration with Analytics Platform |
Real-time messaging | - High throughput with near zero latency. - Decouples producer (MDM) and consumers. - Supports scalable, event-driven architectures. - Suitable for handling high-frequency updates. |
- Requires Kafka infrastructure setup and maintenance. - Message schema requires careful definition. - Event management can become complex in distributed systems. |
- MDM Business events are configured to publish the create/updates to the MDM data to Kafka Topics. - CAI workflows can be used to transform the MDM data into the target publish schema format. |
| Via ODS | - Batch-oriented data integration. - Suitable for ETL, data warehousing, or reporting workflows. |
Scheduled batch | - Simple, widely used approach. - Compatible with ETL workflows and BI tools. - Easy integration with legacy systems. - Data can be directly consumed from tables. |
- No real-time capability. - Requires monitoring for large dataset loads. - Table schema changes can impact downstream consumers. |
- Data from MDM base object collection is exported via an eggress job to Postgres tables . Materialized views are created on the tables which can be consumed by the consumers wither via Mviews or API s - CDI jobs can be used to transform the MDM data into the target publish schema format. |
| Via Files | - Batch-oriented data integration. - Suitable for external systems . |
Scheduled batch | - Simple, widely used approach. - Compatible with ETL workflows and BI tools. - Easy integration with legacy systems. - Data can be directly consumed from files. |
- No real-time capability. - Requires monitoring for large dataset loads. -Files need to be maintained /archived regularly. |
- Data from MDM base object collection is exported via an eggress job to files in the publish format to an S3 bucket or ADLS - CDI jobs can be used to transform the MDM data into the target publish schema format. |
Data Contract#
A Data Contract is a critical component that needs to be defined to enable seamless integration between MDM and the downstream systems while supporting complaince ,reducing errors and enabling effective change management.
A data contract is a formal, reusable definition that specifies:
- The schema,
- The constraints,
- The rules,
- The types of messages that will be exchanged.
It guarantees:
- All producers and consumers conform to this format.
- Thereโs a clear, documented "agreement" between the MDM team (publisher) and the consuming services (subscribers or API clients).
As a technical document, the Data Contract plays a key role in sustaining data quality. Adhering to the recommended specification contributes significantly to achieving FAIR principles.
The specified format, currently adopted for Data Marketplace purposes, was developed by the Data Mesh Architecture organization (datacontract.com).
The Data Contract Specification enables extensive documentation of various aspects of a Data Product, including:
- Definitions
- Background and Context
- Ownership
- Terms of Use
- Servers
- Data Models, including:
- Attribute definitions
- Data quality requirements
- Lineage
- Sensitivity
- Examples
- And more
- SLAs
- And more
* Data Contract Example
* Data Contract Specification
Example of Data Contract for NNDM (country)#
dataContractSpecification: 0.9.3
id: urn:data:source:0012293(product:0x1e5e04b1.contract:0x3e7f1374)
info:
title: API - Country List
version: 0.0.1
status: active
owner: urn:team:source:0012293(domain:root)
contact:
name: Sรธren Skibelund
email: OSKI@novonordisk.com
Data Quality Description: "In country data source, we prioritize quality by verifying country records while inserting them into the Informatica Reference 360. Only verified details will be approved and others will be rejected before inserting into Reference 360 by the designated approver. This helps to have the consistent and valid country data as per Novo standards. Manual Quality Assurance: Designated approver will check and approve the data before inserting or updating in Reference 360 based on the set workflow"
terms:
usage: |
Unlimited number of queries per day
limitations: |
Not suitable for real-time use cases.
models:
countries:
description: One record per country. ISO 3166-1 alpha-2.
type: table
fields:
Code:
type: string
format: ISO 3166-1 alpha-2
required: true
unique: true
primaryKey: true
fields:
type: object
fields:
Code:
type: string
required: true
status:
type: string
required: true
Name:
type: string
required: true
Description:
type: string
CodeTertiary1:
type: string
CodeSecondary1:
type: string
servicelevels:
availability:
description: Mulesoft availability
percentage: 99.9%
frequency:
description: "Real-time Updates: Inclusion of countries and update on an existing country fields (Description, Name, Secondary Code, Tertiary Code etc.,) will be updated in Reference 360 based on the set workflow."
Avro Schema#
When MDM publishes a data set (say, substance data to Kafka), it typically emits messages serialized in Avro format.
- Avro is a lightweight, schema-centric format designed for fast, compact messages.
-
The Avro Schema explicitly specifies:
-
The type of record (like "Substance"),
- The fields it contains (such as
substanceId,stage,origin,nnSubstanceName), - The type of each field (string, int, array, enum, etc.).
This lets Kafka consumers know exactly how to parse the messages โ and makes sure the messages stay forward- and backward-compatible.
Example (substance)#
Substance
{
"fields": [
{
"name": "substance",
"type": {
"fields": [
{
"name": "substanceGlobalID",
"type": "string"
},
{
"name": "stage",
"type": "string"
},
{
"name": "origin",
"type": "string"
},
{
"name": "inNNCD",
"type": "string"
},
{
"name": "nnSubstanceID",
"type": "string"
},
{
"name": "nnSubstanceName",
"type": "string"
},
{
"name": "inn",
"type": [
"null",
"string"
]
},
{
"name": "analyteNumber",
"type": [
"null",
"string"
]
},
{
"name": "developmentSubstanceEVCode",
"type": [
"null",
"string"
]
},
{
"name": "innSubstanceEVCode",
"type": [
"null",
"string"
]
},
{
"logicalType": "timestamp-millis",
"name": "createDateTime",
"type": "long"
},
{
"logicalType": "timestamp-millis",
"name": "latestUpdateDateTime",
"type": "long"
},
{
"default": null,
"doc": "substanceAlternateNames Details",
"name": "substanceAlternateNames",
"type": [
"null",
{
"fields": [
{
"default": null,
"name": "substanceNameType",
"type": [
"null",
"string"
]
},
{
"default": null,
"name": "substanceName",
"type": [
"null",
"string"
]
},
{
"default": null,
"logicalType": "timestamp-millis",
"name": "effectiveFromDateTime",
"type": [
"null",
"long"
]
}
],
"name": "substanceAlternateNames",
"type": "record"
}
]
},
{
"doc": "referenceSubstance Details",
"name": "referenceSubstance",
"type": [
"null",
{
"fields": [
{
"name": "substanceGlobalID",
"type": "string"
},
{
"name": "stage",
"type": "string"
},
{
"name": "origin",
"type": "string"
},
{
"name": "inNNCD",
"type": "string"
},
{
"name": "nnSubstanceID",
"type": "string"
},
{
"name": "nnSubstanceName",
"type": "string"
},
{
"name": "inn",
"type": [
"null",
"string"
]
},
{
"name": "analyteNumber",
"type": [
"null",
"string"
]
},
{
"name": "developmentSubstanceEVCode",
"type": [
"null",
"string"
]
},
{
"name": "innSubstanceEVCode",
"type": [
"null",
"string"
]
}
],
"name": "referenceSubstance",
"type": "record"
}
]
}
],
"name": "substance",
"type": "record"
}
}
],
"name": "substanceMessage",
"namespace": "com.novonordisk.prodex.operations.substance",
"type": "record"
}
API Specification#
Your API specification (typically defined in OpenAPI/Swagger) describes how clients can retrieve or manipulate this master data.
This covers:
- The API endpoint URL, e.g.:
GET /v1/substance/{id}
id.
* The query parameters, if applicable.
* The API response format, which often uses JSON Schema to describe the payload โ analogous to the Avro schema โ but tailored for a REST API context.
Example (OpenAPI)#
openapi: 3.0.0
info:
title: Substance API
version: "1.0"
paths:
/v1/substance/{id}:
get:
summary: Retrieve a Substance by its ID
parameters:
- name: id
in: path
required: true
schema:
type: string
responses:
"200":
description: Successful response
content:
application/json:
schema:
$ref: "#/components/schemas/Substance"
components:
schemas:
Substance:
type: object
properties:
Substance Global-Id:
type: string
NN Substance ID:
type: string
Origin:
type: string
In NNC:
type: string
Sample Response
Header
client_id
client_secret
Body
[
{
"substanceGlobalID": "SUB-0000687",
"NNSubstanceId": "NNC0098-0000-1179",
"origin": "External",
"inNNCD": "Yes",
"stage": "Released",
"nnSubstanceName": "NNC0098-1179",
"inn": "IN1179",
"analyteNumber": "AN000079",
"developmentSubstanceEVCode": "SUB001179",
"innSubstanceEVCode": "SUB210224",
"createDateTime": "2023-12-11T08:45:38.608Z",
"latestUpdateDateTime": "2024-02-21T10:07:27.488Z",
"referenceSubstance": [
{
"substanceGlobalID": "SUB-0000684",
"nnSubstanceID": "NNC0098-0000-1174",
"origin": "Internal",
"inNNCD": "Yes",
"stage": "Released",
"nnSubstanceName": "NNC0098-1174",
"inn": null,
"analyteNumber": "AN000074",
"developmentSubstanceEVCode": "SUB000174",
"innSubstanceEVCode": "SUB000174"
}
]
}
]
Sample Response - 400
{
"success": false,
"apiName": "mdm-prodex-substance-prc",
"version": "v1",
"correlationId": "961101b4-d8c6-453e-81aa-0c0b47e7011e",
"timestamp": "2025-05-08T09:25:31.686339702Z",
"errorDetails": "The searched field has no data in MDM"
}
{
"success": false,
"apiName": null,
"version": null,
"correlationId": "8aa2cb70-d7d6-11eb-b4d1-02018dd03ba2",
"timestamp": "2021-06-28T02:09:22.436-04:00",
"errorDetails": [
{
"code": 401,
"reason": "HTTP:UNAUTHORIZED",
"message": "HTTP POST on resource 'https://..........' failed: unauthorized (401)"
}
]
}
Sample Response -422
{
"success": false,
"apiName": "mdm-prodex-substance-prc",
"version": "v1",
"correlationId": "ee006b70-c6b6-4981-b0a5-3ef3166490aa",
"timestamp": "2025-05-08T09:26:20.83062079Z",
"errorDetails": "The Search Criteria is incorrect"
}