Training a Statistical Methods Advisor from Scratch
111 methods, one knowledge graph, a fine-tuned LLM.
I built this interactive statistics reference a while back — 111 method cards organized into a decision tree. You describe your data, it walks you through a series of questions, and it spits out the right test or model. It works. But it’s basically a lookup table with branching logic. I wanted to see if I could train a small language model to do the reasoning part — take a natural-language description of a data scenario and return structured recommendations without needing a decision tree at all.
So I started with this.
This is the original wizard. 111 statistical methods across 9 domains — infrastructure, data wrangling, EDA, assumption checking, statistical tests, modeling, model selection, causal inference, and missing data. You can click through the decision tree or search by keyword. It’s self-contained HTML, no backend.
The thing I kept bumping into: the wizard is great when the user’s question maps cleanly onto one branch of the tree. But real questions are messy. “I have 500 patients, some are censored, I want to compare survival between three treatment arms, and my proportional hazards assumption might be violated.” That touches multiple branches. A decision tree doesn’t handle ambiguity well. A model might.
The knowledge graph.
The first step was extracting the wizard’s knowledge into something a training pipeline could consume. The wizard’s data lives as JavaScript constants embedded in the HTML — arrays of section objects, card objects, decision tree nodes, and synonym groups. I wrote a Node.js script (training/extract_kg.js) that pulls these out and flattens them into a single JSON file.
The result is data/stat-kg.json: 111 method cards, 24 decision tree nodes, 18 synonym groups, and 9 top-level domains. Each card has a title, R and Python code, assumptions, what to do when assumptions fail, a plain-English explanation of when to use it, related methods, and searchable tags. That’s the entire knowledge base — static, curated, version-controlled.
Generating training data.
The model needs to learn a mapping: natural-language question → structured JSON response. So I needed hundreds of (question, answer) pairs grounded in the knowledge graph.
I wrote a template expansion system (training/generate_data.py) that generates realistic questions from card metadata. For each of the 111 cards, it fills in scenario templates with domain-specific slot values — outcome types, sample sizes, study designs, distributional notes. A t-test card might produce: “I have blood pressure measured in two independent groups of 50 patients each. Data looks roughly normal. What test should I use?” The answer is a structured JSON object with reasoning steps, ranked recommendations, assumption checklists, and fallback methods.
888 examples from templates alone. The script also supports a teacher distillation mode — you load a larger open-source model (Qwen3-8B on a free Colab T4) and have it generate diverse, realistic questions for each card. No API keys, no cost. The hybrid approach combines both: teacher-distilled questions for diversity, templates for coverage.
Every example gets validated against the JSON schema and the knowledge graph’s card IDs before it hits the training set. No hallucinated method names.
Training.
The whole pipeline runs in a single Colab notebook on a free T4 GPU. Qwen3-8B generates the teacher-distilled data, then we fine-tune Qwen2.5-1.5B-Instruct with QLoRA — freeze the base model in 4-bit precision and only train a small set of low-rank adapter weights. The key settings: LoRA rank 32, NF4 quantization, cosine learning rate schedule, effective batch size 16, 3 epochs, seed 42.
After training, the LoRA adapter gets merged back into the base model and the whole thing gets quantized to GGUF format (Q4_K_M) via llama.cpp. The final artifact is a single ~1 GB file that runs anywhere — Ollama, llama.cpp, even in-browser via WebLLM.
# register with Ollama
ollama create stat-wiz -f model/Modelfile
# ask it something
ollama run stat-wiz "I have two independent groups of 50 patients each, and I want to compare their blood pressure. Data looks roughly normal."
Temperature is set to 0 — same question, same answer, every time. No API keys, no retrieval pipeline, no vector database. The knowledge lives in the weights.
Does it work?
(This section will be updated with eval results after training. Run python training/eval.py --backend ollama --model stat-wiz to populate data/eval_report.json.)
The evaluation framework tests JSON validity (does the model return parseable JSON?), reasoning chain presence (does it show its work?), top-1 and top-3 accuracy (does it recommend the right method?), and card ID validity (does it hallucinate methods that don’t exist in the knowledge graph?).
Reflecting on the approach.
The interesting thing here isn’t the model itself — 1.5 billion parameters is small, and template-generated training data has a diversity ceiling. The interesting thing is the pipeline: you take a hand-curated, inspectable decision tree and compress it into model weights. The knowledge that used to live in if-else branches now lives in floating-point matrices.
Is that better? The wizard is deterministic and inspectable — you can trace exactly why it recommended a t-test. The model is also deterministic (temperature 0), but its reasoning chain is generated, not traced. It’s a simulation of transparency, not the real thing. On the other hand, the model handles ambiguity in ways the wizard can’t. It can reason about overlapping domains, express uncertainty, and ask follow-up questions.
What worked: using the knowledge graph as a training data source. Having 111 curated cards with structured metadata makes it trivial to generate grounded examples. QLoRA makes the compute accessible — free tier Colab is enough. GGUF makes the deployment portable — one file, many runtimes.
What I’d do differently: a larger student model. 1.5B parameters is the minimum viable size for structured JSON output. A 3B or 7B base would probably be worth the extra compute for production use — still free on Colab, just slower to train.
The code is all in the repo if you want to try it — training/ has the full pipeline from knowledge graph to GGUF.