Synthesize Data from Your CSV
Upload a real dataset and let RadMah AI train a generative model on your data. Generate thousands of high-fidelity synthetic rows that preserve the statistical properties of your original — with a cryptographic evidence bundle proving it.
✦What you'll build
By the end of this guide you will have uploaded a CSV dataset, trained a generative model on it, synthesized 10,000 new rows that match the original distributions and correlations, and received a sealed evidence bundle with utility metrics proving fidelity.
Prerequisites
| Requirement | Details |
|---|---|
| RadMah AI account | A funded account with GPU credits. Sign up here. |
| API key | Create one in Settings → API Keys. See Authentication guide. |
| CSV dataset | A well-formed CSV with headers. Minimum 500 rows recommended for quality results. Maximum 50 MB per upload. |
| Language runtime | Python 3.10+ (or any HTTP client for the REST API) |
How synthesis differs from mock generation
| Aspect | Mock generation | Synthesis |
|---|---|---|
| Input | Plain English description | Your real CSV dataset |
| Model | Rule-based fabrication | Generative model trained on your data |
| Fidelity | Realistic but generic | Preserves distributions, correlations, and statistical structure |
| Time | Seconds | 5 – 30 minutes (training + generation) |
| Use case | Prototypes, testing, demos | Privacy-safe sharing, ML augmentation, analytics |
Install the SDK
Choose your language and install the RadMah AI SDK.
pip install radmah-sdkAuthenticate
Initialize the client with your API key. All requests are authenticated via the X-API-Key header.
from radmah_sdk import RadMahClient
client = RadMahClient(
api_key="sl_live_your_key_here",
base_url="https://api.radmah.ai" # optional
)⚠Keep your API key secret
Never commit API keys to source control. Use environment variables or a secrets manager in production.
Upload your CSV dataset
Upload the CSV file you want to synthesize from. The platform stores it securely and uses it as training data for the generative model.
# Upload your CSV — returns a file reference
upload = client.upload_file("customers_2024.csv")
file_id = upload.id
print(f"Uploaded: {file_id} ({upload.row_count} rows, {upload.column_count} columns)")ℹSupported formats
CSV files with a header row. UTF-8 encoding. Columns can contain numeric, categorical, datetime, and text data. The platform automatically detects column types during upload.
Create a session and request synthesis
Open a chat session and tell the platform what you want. Reference the uploaded file and specify how many synthetic rows to generate. The AI orchestrator builds the full execution plan automatically.
# Create a chat session
session = client.create_chat_session()
# Request synthesis from your uploaded dataset
response = client.send_chat_message(
session.id,
f"Synthesize 10,000 rows from my uploaded dataset {file_id}. "
"Preserve all column distributions and correlations."
)
# The orchestrator creates an agent project
project_id = response["results"][0]["data"]["project_id"]
print(f"Project created: {project_id}")Review and approve the execution plan
Before training begins, RadMah AI shows you the full execution plan including estimated credits, GPU tier, training duration, and generation parameters. Nothing runs without your explicit approval.
import time, json
# Wait for the plan to be ready
while True:
project = client.get_agent_project(project_id)
if project.status != "planning":
break
time.sleep(2)
# Inspect the plan — synthesis includes train + generate + verify
plan = json.loads(project.plan) if isinstance(project.plan, str) else project.plan
for step in plan:
print(f" Step {step['step_index']}: {step['tool_name']}")
# Review cost and GPU estimate
print(f"Estimated cost: {project.cost_summary}")
print(f"GPU tier: {project.gpu_tier}")
# Approve — this starts training
if project.status == "awaiting_approval":
client.approve_agent_project(project_id)
print("Plan approved — training starting")Execution Pipeline
The orchestrator runs this pipeline automatically. The platform trains a generative model on your data, generates the requested number of synthetic rows, then verifies statistical fidelity and seals the evidence bundle.
ℹTraining time
Training typically takes 5 to 30 minutes depending on dataset size, column count, and the GPU tier allocated. Larger datasets with complex correlations take longer but produce higher-fidelity output. You can close your browser and check back later — the platform sends a notification when the job completes.
Wait for training and generation to complete
Poll the project status until it reaches complete. The status will progress through training, generating, and verifying before completing.
# Poll until complete — synthesis takes longer than mock generation
for i in range(600):
project = client.get_agent_project(project_id)
steps = " | ".join(
f"{s.tool_name}={s.status}" for s in (project.steps or [])
)
if i % 10 == 0:
print(f" [{project.status}] {steps}")
if project.status in ("complete", "failed", "blocked"):
break
time.sleep(5)
print(f"Final status: {project.status}")Download your synthetic data and evidence
Once complete, download the synthetic CSV and the full evidence bundle. The evidence includes utility metrics that show how closely the synthetic data matches your original distributions and correlations.
import os
os.makedirs("output", exist_ok=True)
for step in project.steps or []:
job_id = getattr(step, "job_run_id", None)
if not job_id:
continue
# List artifacts for this step
artifacts = client.list_artifacts(str(job_id))
for artifact in artifacts:
data = client.download_artifact(str(job_id), str(artifact.id))
filename = f"output/{step.tool_name}_{artifact.name}"
with open(filename, "wb" if isinstance(data, bytes) else "w") as f:
f.write(data)
print(f"Saved: {filename}")
# Download evidence bundle
evidence = client.get_evidence_data(str(job_id))
print(f"Evidence: {evidence.get('seal_status')}")
print(f"Artifacts: {len(evidence.get('artifact_index', {}).get('artifacts', []))}")Interpreting Utility Metrics
The evidence bundle includes a utility metrics report that quantifies how closely the synthetic data matches your original. Key metrics to look for:
| Metric | What it measures | Good range |
|---|---|---|
| Column distribution similarity | Per-column statistical distance between original and synthetic | > 0.90 |
| Correlation preservation | How well inter-column relationships are maintained | > 0.85 |
| ML utility score | Train-on-synthetic, test-on-real classifier accuracy ratio | > 0.85 |
| Privacy distance | Nearest-neighbor distance between synthetic and real records | > threshold |
✦Higher fidelity from more data
The generative model learns better with more training data. Datasets with 5,000+ rows typically achieve column distribution similarity above 0.95. If your scores are lower than expected, try increasing the original dataset size or reducing the number of requested synthetic rows.
Evidence Bundle
Every synthesis job produces a signed evidence bundle. A cryptographic seal binds every artefact together — tampering with any one invalidates the entire bundle.
| # | Artifact | Purpose |
|---|---|---|
| 1 | sealed contract | Machine-readable generation specification |
| 2 | Run Manifest | Execution telemetry — rows, timing, entity counts |
| 3 | Constraint Report | Hard/soft violation counts and details |
| 4 | Determinism Proof | Cryptographic hash proving reproducibility (same seed = same output) |
| 5 | Privacy Report | Differential privacy and re-identification risk analysis |
| 6 | Utility Metrics | Statistical fidelity — distribution accuracy, correlation preservation |
| 7 | Artifact Manifest | Index of every artefact with cryptographic hashes |
| 8 | Timing Telemetry | Per-step execution timing and resource usage |
| 9 | Evidence Seal | Cryptographic seal binding every artefact |
What you get
Synthetic CSV
High-fidelity rows that preserve the statistical structure of your original data. Safe to share, publish, or use for ML training.
Evidence Bundle
Signed evidence bundle proving integrity, determinism, privacy, and fidelity. Audit-ready.
Utility Report
Quantitative comparison of original vs. synthetic distributions, correlations, and ML utility scores.
REST API (cURL)
Prefer raw HTTP? Every SDK method maps directly to a REST endpoint.
# 1. Upload your CSV
curl -X POST https://api.radmah.ai/v1/client/files \
-H "X-API-Key: sl_live_your_key_here" \
-F "file=@customers_2024.csv"
# 2. Create a chat session
curl -X POST https://api.radmah.ai/v1/client/chat/sessions \
-H "X-API-Key: sl_live_your_key_here" \
-H "Content-Type: application/json"
# 3. Request synthesis
curl -X POST https://api.radmah.ai/v1/client/chat/sessions/{session_id}/messages \
-H "X-API-Key: sl_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{"content": "Synthesize 10,000 rows from my uploaded dataset {file_id}"}'
# 4. Approve the plan
curl -X POST https://api.radmah.ai/v1/client/agent/projects/{project_id}/approve \
-H "X-API-Key: sl_live_your_key_here"
# 5. Poll status (repeat until status is "complete")
curl https://api.radmah.ai/v1/client/agent/projects/{project_id} \
-H "X-API-Key: sl_live_your_key_here"
# 6. Download artifacts
curl https://api.radmah.ai/v1/client/jobs/{job_id}/artifacts \
-H "X-API-Key: sl_live_your_key_here"Next steps
Generate data from a plain English description — no CSV required
API keys, JWT tokens, and multi-factor authentication
Deep dive into the signed evidence bundle
Understand sealed contract — the typed specification behind every generation
Complete endpoint documentation with request/response schemas