Guide15 min read

Synthesize Data from Your CSV

Upload a real dataset and let RadMah AI train a generative model on your data. Generate thousands of high-fidelity synthetic rows that preserve the statistical properties of your original — with a cryptographic evidence bundle proving it.

✦What you'll build

By the end of this guide you will have uploaded a CSV dataset, trained a generative model on it, synthesized 10,000 new rows that match the original distributions and correlations, and received a sealed evidence bundle with utility metrics proving fidelity.

Prerequisites

Requirement	Details
RadMah AI account	A funded account with GPU credits. Sign up here.
API key	Create one in Settings → API Keys. See Authentication guide.
CSV dataset	A well-formed CSV with headers. Minimum 500 rows recommended for quality results. Maximum 50 MB per upload.
Language runtime	Python 3.10+ (or any HTTP client for the REST API)

How synthesis differs from mock generation

Aspect	Mock generation	Synthesis
Input	Plain English description	Your real CSV dataset
Model	Rule-based fabrication	Generative model trained on your data
Fidelity	Realistic but generic	Preserves distributions, correlations, and statistical structure
Time	Seconds	5 – 30 minutes (training + generation)
Use case	Prototypes, testing, demos	Privacy-safe sharing, ML augmentation, analytics

Install the SDK

Choose your language and install the RadMah AI SDK.

Install

pip install radmah-sdk

Authenticate

Initialize the client with your API key. All requests are authenticated via the X-API-Key header.

Initialize client

from radmah_sdk import RadMahClient

client = RadMahClient(
    api_key="sl_live_your_key_here",
    base_url="https://api.radmah.ai"  # optional
)

⚠Keep your API key secret

Never commit API keys to source control. Use environment variables or a secrets manager in production.

Upload your CSV dataset

Upload the CSV file you want to synthesize from. The platform stores it securely and uses it as training data for the generative model.

Upload dataset

# Upload your CSV — returns a file reference
upload = client.upload_file("customers_2024.csv")
file_id = upload.id
print(f"Uploaded: {file_id} ({upload.row_count} rows, {upload.column_count} columns)")

ℹSupported formats

CSV files with a header row. UTF-8 encoding. Columns can contain numeric, categorical, datetime, and text data. The platform automatically detects column types during upload.

Create a session and request synthesis

Open a chat session and tell the platform what you want. Reference the uploaded file and specify how many synthetic rows to generate. The AI orchestrator builds the full execution plan automatically.

Request synthesis

# Create a chat session
session = client.create_chat_session()

# Request synthesis from your uploaded dataset
response = client.send_chat_message(
    session.id,
    f"Synthesize 10,000 rows from my uploaded dataset {file_id}. "
    "Preserve all column distributions and correlations."
)

# The orchestrator creates an agent project
project_id = response["results"][0]["data"]["project_id"]
print(f"Project created: {project_id}")

Review and approve the execution plan

Before training begins, RadMah AI shows you the full execution plan including estimated credits, GPU tier, training duration, and generation parameters. Nothing runs without your explicit approval.

Approve execution plan

import time, json

# Wait for the plan to be ready
while True:
    project = client.get_agent_project(project_id)
    if project.status != "planning":
        break
    time.sleep(2)

# Inspect the plan — synthesis includes train + generate + verify
plan = json.loads(project.plan) if isinstance(project.plan, str) else project.plan
for step in plan:
    print(f"  Step {step['step_index']}: {step['tool_name']}")

# Review cost and GPU estimate
print(f"Estimated cost: {project.cost_summary}")
print(f"GPU tier: {project.gpu_tier}")

# Approve — this starts training
if project.status == "awaiting_approval":
    client.approve_agent_project(project_id)
    print("Plan approved — training starting")

Execution Pipeline

train_model→synthesize_data→verify

The orchestrator runs this pipeline automatically. The platform trains a generative model on your data, generates the requested number of synthetic rows, then verifies statistical fidelity and seals the evidence bundle.

ℹTraining time

Training typically takes 5 to 30 minutes depending on dataset size, column count, and the GPU tier allocated. Larger datasets with complex correlations take longer but produce higher-fidelity output. You can close your browser and check back later — the platform sends a notification when the job completes.

Wait for training and generation to complete

Poll the project status until it reaches complete. The status will progress through training, generating, and verifying before completing.

Poll for completion

# Poll until complete — synthesis takes longer than mock generation
for i in range(600):
    project = client.get_agent_project(project_id)
    steps = " | ".join(
        f"{s.tool_name}={s.status}" for s in (project.steps or [])
    )
    if i % 10 == 0:
        print(f"  [{project.status}] {steps}")

    if project.status in ("complete", "failed", "blocked"):
        break
    time.sleep(5)

print(f"Final status: {project.status}")

Download your synthetic data and evidence

Once complete, download the synthetic CSV and the full evidence bundle. The evidence includes utility metrics that show how closely the synthetic data matches your original distributions and correlations.

Download artifacts

import os

os.makedirs("output", exist_ok=True)

for step in project.steps or []:
    job_id = getattr(step, "job_run_id", None)
    if not job_id:
        continue

    # List artifacts for this step
    artifacts = client.list_artifacts(str(job_id))
    for artifact in artifacts:
        data = client.download_artifact(str(job_id), str(artifact.id))
        filename = f"output/{step.tool_name}_{artifact.name}"
        with open(filename, "wb" if isinstance(data, bytes) else "w") as f:
            f.write(data)
        print(f"Saved: {filename}")

    # Download evidence bundle
    evidence = client.get_evidence_data(str(job_id))
    print(f"Evidence: {evidence.get('seal_status')}")
    print(f"Artifacts: {len(evidence.get('artifact_index', {}).get('artifacts', []))}")

Interpreting Utility Metrics

The evidence bundle includes a utility metrics report that quantifies how closely the synthetic data matches your original. Key metrics to look for:

Metric	What it measures	Good range
Column distribution similarity	Per-column statistical distance between original and synthetic	> 0.90
Correlation preservation	How well inter-column relationships are maintained	> 0.85
ML utility score	Train-on-synthetic, test-on-real classifier accuracy ratio	> 0.85
Privacy distance	Nearest-neighbor distance between synthetic and real records	> threshold

✦Higher fidelity from more data

The generative model learns better with more training data. Datasets with 5,000+ rows typically achieve column distribution similarity above 0.95. If your scores are lower than expected, try increasing the original dataset size or reducing the number of requested synthetic rows.

Evidence Bundle

Every synthesis job produces a signed evidence bundle. A cryptographic seal binds every artefact together — tampering with any one invalidates the entire bundle.

#	Artifact	Purpose
1	sealed contract	Machine-readable generation specification
2	Run Manifest	Execution telemetry — rows, timing, entity counts
3	Constraint Report	Hard/soft violation counts and details
4	Determinism Proof	Cryptographic hash proving reproducibility (same seed = same output)
5	Privacy Report	Differential privacy and re-identification risk analysis
6	Utility Metrics	Statistical fidelity — distribution accuracy, correlation preservation
7	Artifact Manifest	Index of every artefact with cryptographic hashes
8	Timing Telemetry	Per-step execution timing and resource usage
9	Evidence Seal	Cryptographic seal binding every artefact

What you get

Synthetic CSV

High-fidelity rows that preserve the statistical structure of your original data. Safe to share, publish, or use for ML training.

Evidence Bundle

Signed evidence bundle proving integrity, determinism, privacy, and fidelity. Audit-ready.

Utility Report

Quantitative comparison of original vs. synthetic distributions, correlations, and ML utility scores.

REST API (cURL)

Prefer raw HTTP? Every SDK method maps directly to a REST endpoint.

REST API

# 1. Upload your CSV
curl -X POST https://api.radmah.ai/v1/client/files \
  -H "X-API-Key: sl_live_your_key_here" \
  -F "file=@customers_2024.csv"

# 2. Create a chat session
curl -X POST https://api.radmah.ai/v1/client/chat/sessions \
  -H "X-API-Key: sl_live_your_key_here" \
  -H "Content-Type: application/json"

# 3. Request synthesis
curl -X POST https://api.radmah.ai/v1/client/chat/sessions/{session_id}/messages \
  -H "X-API-Key: sl_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"content": "Synthesize 10,000 rows from my uploaded dataset {file_id}"}'

# 4. Approve the plan
curl -X POST https://api.radmah.ai/v1/client/agent/projects/{project_id}/approve \
  -H "X-API-Key: sl_live_your_key_here"

# 5. Poll status (repeat until status is "complete")
curl https://api.radmah.ai/v1/client/agent/projects/{project_id} \
  -H "X-API-Key: sl_live_your_key_here"

# 6. Download artifacts
curl https://api.radmah.ai/v1/client/jobs/{job_id}/artifacts \
  -H "X-API-Key: sl_live_your_key_here"

Next steps

Generate Mock Data

Generate data from a plain English description — no CSV required

Authentication

API keys, JWT tokens, and multi-factor authentication

Evidence Bundles

Deep dive into the signed evidence bundle

Generation Contracts

Understand sealed contract — the typed specification behind every generation

REST API Reference

Complete endpoint documentation with request/response schemas