Datastudy

Building a Mock Data Generator Powered by LLMs

Oct. 28, 2025

Lars Datastudy
Lars Cornelissen • Follow
Data Science Engineer, Owner

10 min read

Building a Mock Data Generator Powered by LLMs
Mock data generation has always been a necessary evil in data engineering. We need data that looks real, behaves statistically like production data, and is safe to share - but handcrafting that data or relying on static faker libraries rarely gives r

Recently, I built a system that uses Large Language Models (LLMs) to write the code that generates mock data for me. The result is a tool that analyzes a dataset, asks an LLM to produce Python code that mimics it, and then executes that code in a sandbox. In this post, I’ll walk you through exactly how that works — including the prompt design, architecture, and ready-to-run code snippets you can try yourself.

Why Use an LLM for Mock Data?

Traditional mock data tools are deterministic and template-based. They’re good at creating names, addresses, and emails, but not so good at matching real statistical patterns or relationships between columns.

An LLM, on the other hand, understands context. It can look at a dataset and realize, for instance, that “age correlates with income” or “these timestamps follow a daily pattern.” Instead of manually defining random distributions, we can let the model generate the Python code that simulates those patterns.

So the key idea is this:

Don’t ask the model for the data — ask it to write the Python that creates the data.

That small difference unlocks a scalable, privacy-safe, and adaptable mock data system.

How It Works

The generator follows a simple flow:

  1. Upload a dataset (CSV, for example).
  2. Analyze it — detect column types, numeric distributions, and category frequencies.
  3. Prompt the LLM to write a Python function main(source_data) that returns a DataFrame with the same structure.
  4. Execute the generated code inside a controlled namespace.
  5. Return the mock dataset and show the generated code.

This architecture lets the LLM focus on logic, not data storage. It never sees real sensitive values — just the analysis results.

The Core Executor

At the heart of the system is a safe executor that runs the code returned by the LLM.
It expects every snippet to define a function named main(source_data).

# executor.py
import pandas as pd

class CodeExecutionError(RuntimeError):
    pass

def run_mockdata_code(code: str, source_df: pd.DataFrame) -> pd.DataFrame:
    """
    Executes LLM-generated code that must define main(source_data)
    and returns its resulting DataFrame.
    """
    exec_globals = {}
    try:
        exec(code, exec_globals)
        if "main" not in exec_globals or not callable(exec_globals["main"]):
            raise CodeExecutionError("No `main(source_data)` function defined.")
        result = exec_globals["main"](source_df)
        return pd.DataFrame(result)
    except Exception as e:
        raise CodeExecutionError(f"Execution failed: {e}")

This ensures that:

  • The code must define a single entry point.
  • Any errors are caught cleanly.
  • Nothing dangerous can persist across executions.

The Prompt That Makes It All Work

Getting the prompt right was crucial.
The model needs to know what to do — and what not to do.

Here’s the exact prompt format used:

You are a Python data generator.
Output only valid Python code.

Define:
    def main(source_data):
        """
        - source_data is a pandas.DataFrame
        - Return a new DataFrame with the same columns, datatypes,
          and similar distributions and category frequencies.
        - Generate exactly {n_rows} rows.
        - Preserve numeric min/max bounds.
        - Use pandas and numpy only.
        """
Constraints:
- No print statements.
- No file I/O.
- No markdown fences, just code.

Dataset summary:
{analysis_result}

The {analysis_result} is filled in by the system — it includes column names, types, means, and frequencies. The LLM sees what the data looks like, not the data itself.

What the Model Returns

When the prompt runs, the LLM outputs pure code.
Here’s an example of a real output that works as-is:

import pandas as pd
import numpy as np

def main(source_data: pd.DataFrame) -> pd.DataFrame:
    rng = np.random.default_rng(42)
    n_rows = len(source_data)
    out = {}

    for col in source_data.columns:
        s = source_data[col].dropna()

        if np.issubdtype(s.dtype, np.number):
            mean, std = float(s.mean()), float(s.std() or 1)
            vals = rng.normal(mean, std, n_rows)
            vals = np.clip(vals, float(s.min()), float(s.max()))
            out[col] = np.round(vals, 2)

        elif s.dtype == bool:
            p_true = (s == True).mean() or 0.5
            out[col] = rng.choice([True, False], size=n_rows, p=[p_true, 1 - p_true])

        else:
            freq = s.astype(str).value_counts(normalize=True)
            cats, probs = freq.index.values, freq.values
            probs = probs / probs.sum()
            out[col] = rng.choice(cats, size=n_rows, p=probs)

    return pd.DataFrame(out)

That’s the magic — the LLM writes a fully working generator based on the data’s statistical description.

Retry Logic and Error Handling

Even the best models sometimes hallucinate code that doesn’t compile.
To make the system reliable, it retries up to five times before giving up:

MAX_ATTEMPTS = 5

def generate_with_retries(llm_func, source_df, analysis_result, n_rows, extra_description=""):
    attempt = 0
    last_error = None
    while attempt < MAX_ATTEMPTS:
        try:
            code = llm_func(analysis_result, n_rows, extra_description)
            df = run_mockdata_code(code, source_df)
            return code, df
        except Exception as e:
            attempt += 1
            last_error = e
    raise RuntimeError(f"Failed after {MAX_ATTEMPTS} attempts: {last_error}")

Each retry asks the model again with a bit more context, improving success rates dramatically.

A Full Example: Energy Usage Simulation

Here’s a more advanced example the model can produce when asked for time-series data — no hand-tuning needed.

Prompt:
“Generate Python code that creates hourly energy usage data for 100 meters over 30 days.”

And the model might output:

import pandas as pd
import numpy as np

def main(source_data=None):
    timestamps = pd.date_range("2025-01-01", periods=30*24, freq="H")
    data = []

    for meter_id in range(100):
        base = np.random.uniform(0.5, 2.0)
        pattern = np.sin(np.linspace(0, 2*np.pi, 24)) + 1
        noise = np.random.normal(0, 0.1, len(timestamps))
        usage = np.tile(pattern, 30) * base + noise
        df = pd.DataFrame({
            "timestamp": timestamps,
            "meter_id": meter_id,
            "energy_kWh": usage.clip(0)
        })
        data.append(df)

    return pd.concat(data, ignore_index=True)

This creates realistic daily consumption patterns for every simulated smart meter — all inferred from one natural-language prompt.

Adding a Simple Streamlit Frontend

To make the tool accessible, a lightweight Streamlit UI wraps the entire process.

import streamlit as st
import pandas as pd
from executor import run_mockdata_code

def fake_llm_func(analysis_result, n_rows, _=""):
    # Normally you’d call your LLM here
    return """
import pandas as pd
import numpy as np
def main(source_data):
    rng = np.random.default_rng(1)
    n_rows = len(source_data)
    out = {}
    for col in source_data.columns:
        s = source_data[col].dropna()
        if np.issubdtype(s.dtype, np.number):
            mu, sigma = s.mean(), s.std() or 1
            vals = rng.normal(mu, sigma, n_rows)
            vals = np.clip(vals, s.min(), s.max())
            out[col] = np.round(vals, 3)
        else:
            freq = s.astype(str).value_counts(normalize=True)
            cats, probs = freq.index.values, freq.values
            probs = probs / probs.sum()
            out[col] = rng.choice(cats, size=n_rows, p=probs)
    return pd.DataFrame(out)
"""

st.title("LLM-Powered Mock Data Generator")

file = st.file_uploader("Upload your CSV")
if file:
    df = pd.read_csv(file)
    st.write("Source Preview", df.head())

    if st.button("Generate Mock Data"):
        code = fake_llm_func("(analysis result here)", len(df))
        mock = run_mockdata_code(code, df)
        st.write("Mock Data Preview", mock.head())
        st.download_button("Download Mock CSV", mock.to_csv(index=False), "mock.csv")

This provides a clean interface where anyone can upload a dataset and instantly get a generated twin.

Lessons Learned

  • Prompt clarity beats prompt complexity.
    The simpler and stricter the contract (“return code that defines main(source_data)”), the more reliable the results.
  • Execution must be sandboxed.
    Never execute arbitrary LLM code without isolation and validation.
  • Analysis before generation is key.
    The model doesn’t need raw data — just a description of it.
  • Retries and feedback loops make it robust.
    A single failure doesn’t mean the idea doesn’t work. The system should retry and adapt.

The Big Picture

This approach blurs the line between AI and automation. The LLM doesn’t just produce static text — it writes executable logic that adapts to your data. Each run can produce unique, statistically aligned mock datasets without leaking private information.

It’s a step toward an intelligent data simulation engine — one where you describe what you need, and the model builds the generator for you.

Mock data is no longer something you fake.
It’s something you generate intelligently.