Nvidia open sources the synthetic data framework used to build Nemotron datasets
alexwatson405 Wednesday, December 03, 2025NVIDIA just open sourced NeMo Data Designer, the synthetic data framework used internally to build both pre-training and post-training datasets for Nemotron.
It lets you define an entire synthetic data pipeline directly in Python: structured outputs, statistical samplers, LLM-generated columns, dependency-aware field relationships, Python/SQL/remote validators, and optional LLM-as-judge scoring. Supports quick preview mode for fast iteration before scaling up.
Install:
``` pip install data-designer ```
A minimal example:
``` from data_designer.essentials import *
data_designer = DataDesigner() config = DataDesignerConfigBuilder()
config.add_column( SamplerColumnConfig( name="product_category", sampler_type=SamplerType.CATEGORY, params=CategorySamplerParams( values=["Electronics", "Clothing", "Home & Kitchen", "Books"] ), ) )
config.add_column( LLMTextColumnConfig( name="review", model_alias="nvidia-text", prompt="Write a short product review for a {{ product_category }} item." ) )
preview = data_designer.preview(config_builder=config) preview.display_sample_record() ```
This release also incorporates the synthetic data tech my team originally built at Gretel (now part of NVIDIA), now generally available for anyone to use or extend.
Repo: https://github.com/NVIDIA-NeMo/DataDesigner