Sign In
Guide15 min read

Synthesize Data from Your CSV

Upload a real dataset and let RadMah AI train a generative model on your data. Generate thousands of high-fidelity synthetic rows that preserve the statistical properties of your original — with a cryptographic evidence bundle proving it.

What you'll build

By the end of this guide you will have uploaded a CSV dataset, trained a generative model on it, synthesized 10,000 new rows that match the original distributions and correlations, and received a sealed evidence bundle with utility metrics proving fidelity.

Prerequisites

RequirementDetails
RadMah AI accountA funded account with GPU credits. Sign up here.
API keyCreate one in Settings → API Keys. See Authentication guide.
CSV datasetA well-formed CSV with headers. Minimum 500 rows recommended for quality results. Maximum 50 MB per upload.
Language runtimePython 3.10+ (or any HTTP client for the REST API)

How synthesis differs from mock generation

AspectMock generationSynthesis
InputPlain English descriptionYour real CSV dataset
ModelRule-based fabricationGenerative model trained on your data
FidelityRealistic but genericPreserves distributions, correlations, and statistical structure
TimeSeconds5 – 30 minutes (training + generation)
Use casePrototypes, testing, demosPrivacy-safe sharing, ML augmentation, analytics
1

Install the SDK

Choose your language and install the RadMah AI SDK.

Install
pip install radmah-sdk
2

Authenticate

Initialize the client with your API key. All requests are authenticated via the X-API-Key header.

Initialize client
from radmah_sdk import RadMahClient

client = RadMahClient(
    api_key="sl_live_your_key_here",
    base_url="https://api.radmah.ai"  # optional
)

Keep your API key secret

Never commit API keys to source control. Use environment variables or a secrets manager in production.

3

Upload your CSV dataset

Upload the CSV file you want to synthesize from. The platform stores it securely and uses it as training data for the generative model.

Upload dataset
# Upload your CSV — returns a file reference
upload = client.upload_file("customers_2024.csv")
file_id = upload.id
print(f"Uploaded: {file_id} ({upload.row_count} rows, {upload.column_count} columns)")

Supported formats

CSV files with a header row. UTF-8 encoding. Columns can contain numeric, categorical, datetime, and text data. The platform automatically detects column types during upload.

4

Create a session and request synthesis

Open a chat session and tell the platform what you want. Reference the uploaded file and specify how many synthetic rows to generate. The AI orchestrator builds the full execution plan automatically.

Request synthesis
# Create a chat session
session = client.create_chat_session()

# Request synthesis from your uploaded dataset
response = client.send_chat_message(
    session.id,
    f"Synthesize 10,000 rows from my uploaded dataset {file_id}. "
    "Preserve all column distributions and correlations."
)

# The orchestrator creates an agent project
project_id = response["results"][0]["data"]["project_id"]
print(f"Project created: {project_id}")
5

Review and approve the execution plan

Before training begins, RadMah AI shows you the full execution plan including estimated credits, GPU tier, training duration, and generation parameters. Nothing runs without your explicit approval.

Approve execution plan
import time, json

# Wait for the plan to be ready
while True:
    project = client.get_agent_project(project_id)
    if project.status != "planning":
        break
    time.sleep(2)

# Inspect the plan — synthesis includes train + generate + verify
plan = json.loads(project.plan) if isinstance(project.plan, str) else project.plan
for step in plan:
    print(f"  Step {step['step_index']}: {step['tool_name']}")

# Review cost and GPU estimate
print(f"Estimated cost: {project.cost_summary}")
print(f"GPU tier: {project.gpu_tier}")

# Approve — this starts training
if project.status == "awaiting_approval":
    client.approve_agent_project(project_id)
    print("Plan approved — training starting")

Execution Pipeline

train_modelsynthesize_dataverify

The orchestrator runs this pipeline automatically. The platform trains a generative model on your data, generates the requested number of synthetic rows, then verifies statistical fidelity and seals the evidence bundle.

Training time

Training typically takes 5 to 30 minutes depending on dataset size, column count, and the GPU tier allocated. Larger datasets with complex correlations take longer but produce higher-fidelity output. You can close your browser and check back later — the platform sends a notification when the job completes.

6

Wait for training and generation to complete

Poll the project status until it reaches complete. The status will progress through training, generating, and verifying before completing.

Poll for completion
# Poll until complete — synthesis takes longer than mock generation
for i in range(600):
    project = client.get_agent_project(project_id)
    steps = " | ".join(
        f"{s.tool_name}={s.status}" for s in (project.steps or [])
    )
    if i % 10 == 0:
        print(f"  [{project.status}] {steps}")

    if project.status in ("complete", "failed", "blocked"):
        break
    time.sleep(5)

print(f"Final status: {project.status}")
7

Download your synthetic data and evidence

Once complete, download the synthetic CSV and the full evidence bundle. The evidence includes utility metrics that show how closely the synthetic data matches your original distributions and correlations.

Download artifacts
import os

os.makedirs("output", exist_ok=True)

for step in project.steps or []:
    job_id = getattr(step, "job_run_id", None)
    if not job_id:
        continue

    # List artifacts for this step
    artifacts = client.list_artifacts(str(job_id))
    for artifact in artifacts:
        data = client.download_artifact(str(job_id), str(artifact.id))
        filename = f"output/{step.tool_name}_{artifact.name}"
        with open(filename, "wb" if isinstance(data, bytes) else "w") as f:
            f.write(data)
        print(f"Saved: {filename}")

    # Download evidence bundle
    evidence = client.get_evidence_data(str(job_id))
    print(f"Evidence: {evidence.get('seal_status')}")
    print(f"Artifacts: {len(evidence.get('artifact_index', {}).get('artifacts', []))}")

Interpreting Utility Metrics

The evidence bundle includes a utility metrics report that quantifies how closely the synthetic data matches your original. Key metrics to look for:

MetricWhat it measuresGood range
Column distribution similarityPer-column statistical distance between original and synthetic> 0.90
Correlation preservationHow well inter-column relationships are maintained> 0.85
ML utility scoreTrain-on-synthetic, test-on-real classifier accuracy ratio> 0.85
Privacy distanceNearest-neighbor distance between synthetic and real records> threshold

Higher fidelity from more data

The generative model learns better with more training data. Datasets with 5,000+ rows typically achieve column distribution similarity above 0.95. If your scores are lower than expected, try increasing the original dataset size or reducing the number of requested synthetic rows.

Evidence Bundle

Every synthesis job produces a signed evidence bundle. A cryptographic seal binds every artefact together — tampering with any one invalidates the entire bundle.

#ArtifactPurpose
1sealed contractMachine-readable generation specification
2Run ManifestExecution telemetry — rows, timing, entity counts
3Constraint ReportHard/soft violation counts and details
4Determinism ProofCryptographic hash proving reproducibility (same seed = same output)
5Privacy ReportDifferential privacy and re-identification risk analysis
6Utility MetricsStatistical fidelity — distribution accuracy, correlation preservation
7Artifact ManifestIndex of every artefact with cryptographic hashes
8Timing TelemetryPer-step execution timing and resource usage
9Evidence SealCryptographic seal binding every artefact

What you get

Synthetic CSV

High-fidelity rows that preserve the statistical structure of your original data. Safe to share, publish, or use for ML training.

Evidence Bundle

Signed evidence bundle proving integrity, determinism, privacy, and fidelity. Audit-ready.

Utility Report

Quantitative comparison of original vs. synthetic distributions, correlations, and ML utility scores.

REST API (cURL)

Prefer raw HTTP? Every SDK method maps directly to a REST endpoint.

REST API
# 1. Upload your CSV
curl -X POST https://api.radmah.ai/v1/client/files \
  -H "X-API-Key: sl_live_your_key_here" \
  -F "file=@customers_2024.csv"

# 2. Create a chat session
curl -X POST https://api.radmah.ai/v1/client/chat/sessions \
  -H "X-API-Key: sl_live_your_key_here" \
  -H "Content-Type: application/json"

# 3. Request synthesis
curl -X POST https://api.radmah.ai/v1/client/chat/sessions/{session_id}/messages \
  -H "X-API-Key: sl_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"content": "Synthesize 10,000 rows from my uploaded dataset {file_id}"}'

# 4. Approve the plan
curl -X POST https://api.radmah.ai/v1/client/agent/projects/{project_id}/approve \
  -H "X-API-Key: sl_live_your_key_here"

# 5. Poll status (repeat until status is "complete")
curl https://api.radmah.ai/v1/client/agent/projects/{project_id} \
  -H "X-API-Key: sl_live_your_key_here"

# 6. Download artifacts
curl https://api.radmah.ai/v1/client/jobs/{job_id}/artifacts \
  -H "X-API-Key: sl_live_your_key_here"

Next steps