| title | OrthoRL — Orthodontic Treatment Planning Environment | ||||||
|---|---|---|---|---|---|---|---|
| emoji | 🦷 | ||||||
| colorFrom | blue | ||||||
| colorTo | green | ||||||
| sdk | docker | ||||||
| app_port | 7860 | ||||||
| tags |
|
Can an LLM agent learn to plan orthodontic aligner treatment, stage by stage, under delayed bone biomechanics?
The first reinforcement-learning environment for per-stage clear-aligner trajectory planning. 24 sequential decisions. 28 teeth in SE(3) space. 1,063 real patient profiles from Beijing Stomatological Hospital. Five algorithmic reward functions. Zero LLM-as-judge.
The live dashboard. Drag the 3D ellipsoids to inspect a real Tsinghua patient's pre-treatment and post-treatment tooth poses.
Clear-aligner planning is an $8 B → $56 B market with a brutal failure mode: only 6% of patients finish on the original plan, and 1 in 6 switches to braces because the digital plan never tracked. The mechanism is structural — planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological, so teeth lose tracking and plans require mid-course corrections.
OrthoRL is the first reinforcement-learning environment for per-stage orthodontic aligner trajectory planning. An LLM agent diagnoses the case, picks a treatment strategy, and plans 24 sequential aligner stages for 28 teeth in SE(3) space under a pharmacokinetic force decay model that delays the actual displacement by ~2 stages. Five algorithmic reward functions. Zero LLM-as-judge.
We trained Qwen2.5-3B end-to-end through a 5-stage pipeline (3 SFT sub-stages → GRPO → RFT) on 1,063 real Tsinghua patients. Result: 0.72 → 0.86 mean reward on 250 unseen patients, with the entire left-tail of failures eliminated.
| 🌐 Live HF Space | https://grimoors-orthorl.hf.space |
| 🤗 Repo on Hugging Face | https://huggingface.co/spaces/Grimoors/orthorl |
| 🤖 Trained model (LoRA adapter) | https://huggingface.co/sri-manikanta/orthorl-grpo |
| 📺 Demo video | https://youtu.be/MK6gfw3ZKz0 |
| 📖 Full blog | BLOG.md |
| 📊 OpenAPI docs | https://grimoors-orthorl.hf.space/docs |
| ❤️ Health probe | https://grimoors-orthorl.hf.space/health |
| 📦 GitHub | https://github.com/mehular0ra/orthorl |
Every year 12 million patients start clear-aligner treatment. Today's planning software defines a target occlusion and runs straight-line interpolation (SLERP) from the patient's current poses to the target. It looks beautiful in software. It breaks in the mouth.
Why? Bone is not glass. The periodontal ligament responds over weeks. Forces applied at stage N don't produce their full effect until stage N+2 or N+3 (Proffit, Contemporary Orthodontics Ch. 8). A trajectory that looks perfect on a CAD viewer overshoots by ~30% on hard cases. Teeth lose tracking. Plans need refinement scans. Only 6% of patients finish on the original plan. 1 in 6 switches to braces.
That gap is why OrthoRL exists.
Two years ago you couldn't have built this. Three things had to land first:
- Real patient data at scale. Wang et al.'s 2024 Nature Scientific Data release of 1,063 Tsinghua patients with pre- and post-treatment tooth landmarks. We use 200 of them with 8 crown landmarks per tooth in FDI notation as the SE(3) ground truth.
- A verifiable reward. Andrews' Six Keys to Normal Occlusion (1972) gives 9 algorithmic geometric metrics computed directly from SE(3) poses. Combined with Kelvin-Voigt periodontal-ligament biomechanics, mesh-based collision detection, and an empirical anchorage prior mined from the data itself — five independent reward functions, all deterministic given seed.
- Group-Relative Policy Optimisation. GRPO eliminates the value network and uses group-relative baselines, making it tractable to train an LLM agent through 24-step rollouts at 4 generations per prompt.
OrthoRL puts an LLM agent in the orthodontist's chair: diagnose the case, pick a strategy, plan 24 aligner stages, get penalised for collisions and over-anchored molars, get rewarded for clinically realistic incisor-first staging — and learn the lead pattern that real orthodontists do intuitively (apply force 2 stages ahead of the desired tooth position).
After the SFT → GRPO → RFT chain, on 250 held-out Tsinghua patients the agent never saw:
| Policy | Mean reward | Std | Left-tail (< 0.6) |
|---|---|---|---|
| SLERP baseline | 0.72 | 0.12 | ~25 % |
| OrthoRL trained | 0.86 | 0.04 | ~0 % |
The trained agent doesn't just shift the average — it eliminates the catastrophic failures that drive the refinement trap. Tighter whiskers, no left tail, clinically realistic staging strategy choices.
OrthoRL is a textbook Theme 3.1 fit. The agent makes real interactions with tools (8 of them — diagnose, simulate, check collisions, commit, rollback, measure crowding/overbite/Angle class), maintains internal state across 24 steps, and updates its beliefs through tool calls and observations. The reward verifies professional task quality (Andrews' Six Keys, biomechanical compliance, treatment-strategy correctness, anchorage realism). Same initial state admits multiple valid trajectories with real clinical tradeoffs (Class II → distalisation vs extraction vs Class II elastics).
Pharmacokinetic force decay is, to our knowledge, the only RL environment with a non-Markov delayed-reward mechanism grounded in human biology. Other envs use delayed rewards as abstractions; OrthoRL models the actual 0–8 week bone-remodelling impulse response as a 5-tap convolution kernel [0.10, 0.30, 0.40, 0.15, 0.05] (Proffit Ch. 8 + Cattaneo 2005). The agent must learn temporal credit assignment over a 2–3 stage horizon — the exact skill that reduces refinement rates in clinical practice.
OpenAI's GDPval benchmark measured frontier models against human experts on 44 economically-valuable expert occupations in the top-9 GDP-contributing US sectors. Dentistry is not on that list. OrthoRL is the training environment GDPval skipped — same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where AI automation is documented at ~80× planning speedup and reducing refinements saves ~20% of aligners per case.
Agent receives observation:
┌─────────────────────────────────────────────┐
│ 28 teeth × [qw,qx,qy,qz, tx,ty,tz] │
│ Current poses + Target poses │
│ Per-tooth progress (0–100%) │
│ Stage number (1–24) │
│ Reward breakdown from previous stage │
└─────────────────────────────────────────────┘
│
▼
Agent emits an action (JSON):
{
"strategy": "anterior_first",
"tooth_groups": [
{"teeth": [11,12,21,22], "fraction": 0.6, "priority": "high"},
{"teeth": [16,17,26,27], "fraction": 0.2, "priority": "low"}
]
}
│
▼
Environment applies the action:
┌─────────────────────────────────────────────┐
│ 1. Parse fractions → SLERP interpolation │
│ 2. Enforce 0.25 mm / 2° per-stage limits │
│ 3. Apply force-decay convolution │
│ 4. Check inter-tooth mesh collisions │
│ 5. Compute PDL stress feasibility │
│ 6. Grade via 5 reward functions │
│ 7. Return new observation + reward │
└─────────────────────────────────────────────┘
| Prior work | What it does | What OrthoRL adds |
|---|---|---|
| Li & Wang 2025 (Sci Rep) — only published RL paper | Coarse clinical decisions (extraction vs surgery) | Per-stage trajectory planning over 24 commits |
| CLIK-Diffusion 2025 (MedIA) | Diffusion model predicts the final alignment | The 24 intermediate stages between init and target |
| Dong & Chen 2025 (ICCV) | Transformer + collision constraints | Multi-step decision making with delayed reward |
| TAlignDiff 2024 | Diffusion target prediction | A Gymnasium-compatible RL environment, not a one-shot model |
┌──────────────────────────────────────────────────────────────────┐
│ OrthoRL System │
│ │
│ ┌─────────────┐ ┌──────────────────────────────────────┐ │
│ │ FastAPI │ │ RL Environment (server/) │ │
│ │ Server │────▶│ DentalAlignerEnvironment │ │
│ │ /reset │ │ ├─ SyntheticCaseGenerator │ │
│ │ /step │ │ ├─ LandmarkLoader (real patients) │ │
│ │ /demo_run │ │ ├─ ForceDecayModel (5-tap kernel) │ │
│ │ /docs │ │ ├─ AdversarialJitter │ │
│ └─────────────┘ │ └─ ClinicalProfileManager │ │
│ │ │ │
│ ┌─────────────┐ │ Grading Stack │ │
│ │ GRPO │ │ ├─ AlignerGrader (4 components) │ │
│ │ Trainer │────▶│ ├─ OcclusionScorer (9 Andrews) │ │
│ │ (embedded) │ │ ├─ PDLModel (Kelvin-Voigt) │ │
│ │ Qwen2.5-3B │ │ ├─ CollisionDetector (mesh) │ │
│ │ + LoRA r=16│ │ └─ RewardScaler ([-2, +8]) │ │
│ └─────────────┘ └──────────────────────────────────────┘ │
│ ┌─────────────┐ ┌──────────────────────────────────────┐ │
│ │ Datasets │ │ Evaluation │ │
│ │ Tsinghua │────▶│ 3-tier held-out: │ │
│ │ 1,063 pts │ │ Tier 1: 250 Tsinghua test │ │
│ │ OFJ 17 pts │ │ Tier 2: 17 Open-Full-Jaw │ │
│ └─────────────┘ │ Tier 3: 40 Bits2Bites │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Each server/ module is self-contained — imports only from dental_constants and quaternion_utils (shared definitional constants and math). No cross-module dependencies. 227 passing tests as the contract.
OrthoRL uses 5 algorithmic reward functions — no LLM-as-judge. Every reward is deterministic given the seed.
| # | Reward | Range | What it measures |
|---|---|---|---|
| 1 | Terminal | [-2, +8] | 4-component score (final accuracy 40 % · smoothness 20 % · compliance 20 % · staging quality 20 %) scaled with hard-fail overrides for collisions / PDL stress |
| 2 | Andrews' Occlusion | [0, 1] | 9 metrics from Andrews' Six Keys: molar relationship, overjet, overbite, crown angulation/inclination, rotations, contact tightness, curve of Spee, arch symmetry |
| 3 | Strategy | {0, 0.5, 1} | 1.2× / 1.0× / 0.6× multiplier for optimal / neutral / contraindicated treatment strategy given diagnosis |
| 4 | Format | [0, 1] | Partial-credit JSON validity check across 5 sub-checks |
| 5 | Anchorage | [0, 1] | Empirical prior mined from 195 real patients (n=5,089 tooth-class observations) — penalises unrealistic molar displacement above the data-driven 90th percentile |
Anti-hacking: five independent rewards (gaming one doesn't help with the others) + hard-fail overrides for collisions and PDL stress + sample completions logged every 10 steps for inspection + algorithmic-only grading (no prompt-injectable judge).
Stage 0: Format SFT ──→ Model learns JSON output structure
│
Stage 1: Tool-use SFT ──→ Model learns tool-calling grammar
│
Stage 2: BC SFT ──→ Model imitates expert staging oracle
│
Stage 3: GRPO ──→ Model improves via 5 reward signals
│
Stage 4: Rejection FT ──→ Variance reduction on best completions
Each stage emits a checkpoint with a gate_passed.json that the next stage validates before starting — no stage runs on a broken predecessor. The interactive Training Stages dashboard renders per-stage curves in three click-to-switch panels.
| Parameter | Value |
|---|---|
| Base model | Qwen2.5-3B-Instruct (4-bit via Unsloth) |
| LoRA | r = 16, α = 32, target = q/k/v/o + gate/up/down_proj |
| GRPO steps | 300 |
| Generations per prompt (G) | 4 |
| Learning rate | 5 × 10⁻⁶ (warmup + linear decay) |
| Reward range | [-2, +8] |
| Compute | Azure ML — ND96asr v4 (A100 80 GB) |
| Cost | ~$3.50 / full pipeline run |
| Policy | Mean reward | Std | Left-tail (< 0.6) |
|---|---|---|---|
| SLERP baseline | 0.72 | 0.12 | ~25 % |
| OrthoRL trained | 0.86 | 0.04 | ~0 % |
The grey distribution (SLERP) has a long left tail of refinement-trap failures. The green distribution (OrthoRL) is narrower, shifted right, and the entire left tail is gone. Robustness, not just average performance.
- Anterior-first staging. Real orthodontists move incisors before molars; the trained agent does the same despite no rule saying so.
- Empirical anchorage. Median molar displacement matches the 0.89 mm population median mined from 195 real treatments.
- Diagnosis-→-strategy alignment. Class II cases preferentially get distalisation; Class III gets reverse-anchorage.
- Force-lead pattern. The agent applies more force in early stages on hard cases — the same compensation real clinicians make for delayed bone response.
SFT loss collapse![]() |
GRPO per-component reward![]() |
GRPO loss + mean reward![]() |
Policy entropy (no mode collapse)![]() |
The full per-stage diagnostic picture lives in the BLOG.md with all 9 plots and stage-by-stage commentary.
git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app # FastAPI on :7860
make test # 227 tests pass in ~60 sOpen the dashboard in a browser at the local URL the server prints.
# Health probe
curl https://grimoors-orthorl.hf.space/health
# {"status":"healthy"}
# Reset to a real Tsinghua patient
curl -X POST https://grimoors-orthorl.hf.space/reset_stepwise \
-H 'Content-Type: application/json' \
-d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh
# or just one stage (re-run GRPO without re-running SFT)
STAGES="3" bash scripts/run_full_pipeline.sh
# evaluate against the locked SLERP baseline
uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft --tier 1 --seeds 3
# regenerate slide-ready plots
uv run python scripts/build_demo_plots.py
uv run python scripts/build_ablation_matrix.pyColab driver: notebooks/colab_a10g.py.
| Artefact | Frequency | Content |
|---|---|---|
logs/run_<TS>.log |
streaming (tee) | every stdout line of every stage |
logs/grpo_samples.jsonl |
every 10 steps | step / loss / lr / per-reward-fn means / completion length |
logs/grpo_completions.jsonl |
every 10 steps | full completion text + slim breakdown |
checkpoints/grpo/trainer_state.json |
every TRL step | TRL log_history with all reward fns |
research/results.tsv |
once at training end | autolog row per multiautoresearch discipline |
assets/static/training/*.png |
once at training end | matplotlib plots — embedded in the dashboard |
| W&B run URL | every step | live cloud dashboard (auto-on if WANDB_API_KEY is set) |
# Dockerfile uses python:3.11-slim + uv
# Serves OpenEnv FastAPI + dashboard on port 7860
huggingface-cli login # paste a write-scoped token
git remote add hf https://huggingface.co/spaces/<user>/orthorl
git push hf main:mainThe Space frontmatter lives in this README's YAML header below — sdk: docker, app_port: 7860. Every push triggers an automatic Docker rebuild. The current live deployment is at https://huggingface.co/spaces/Grimoors/orthorl.
orthorl/
├── server/ # OpenEnv FastAPI environment
│ ├── app.py # create_fastapi_app + dashboard HTML
│ ├── dental_environment.py # StepwiseDentalEnvironment (24-step)
│ ├── grader.py # IMMUTABLE — terminal reward
│ ├── occlusion_scorer.py # Andrews' Six Keys (9 metrics)
│ ├── pdl_model.py # Kelvin-Voigt PDL biomechanics
│ ├── force_decay.py # 5-tap pharmacokinetic kernel
│ ├── collision_detector.py # Oriented mesh / ellipsoid
│ ├── adversarial.py # Patient non-compliance
│ ├── landmark_loader.py # 200 real Tsinghua patients
│ ├── clinical_profiles.py # 1,063 patient profiles
│ ├── eval_split.py # Frozen 250 + 17 + 40 IDs
│ └── ... # (31 modules, all self-contained)
├── scripts/ # SFT / GRPO / RFT / eval drivers
├── tests/ # 227 pytest cases
├── data/ # SFT JSONL datasets
├── datasets/ # Tsinghua + OFJ + Bits2Bites
├── assets/static/ # Dashboard plots + 3D HTMLs (deployed)
├── notebooks/colab_a10g.py # One-cell Colab launcher
├── train_grpo.py # Main GRPO entrypoint
├── eval.py # Held-out evaluation CLI
├── prepare.py # IMMUTABLE benchmark
├── BLOG.md # Full hackathon write-up
├── README.md # ← you are here
├── Dockerfile # python:3.11-slim + uv
├── pyproject.toml + uv.lock # Locked deps
└── openenv.yaml # OpenEnv manifest
- Algorithmic rewards only. No LLM-as-judge anywhere in the reward chain. Every reward function is deterministic given the seed, eliminating prompt-injection reward hacking and making training reproducible.
- Read-only benchmark.
prepare.pyandserver/grader.pyare immutable during experiments. Once a result is recorded inresearch/results.tsv, the benchmark cannot change retroactively. - One hypothesis per training run. Each completed run is appended to
research/results.tsv(append-only); promotion toresearch/live/master.jsonrequires beating the current master metric. - Three-tier held-out eval. 250 Tsinghua + 17 OFJ + 40 Bits2Bites IDs are frozen in
server/eval_split.py, andEvalRegistry.assert_training_legal()raises on any leakage in train mode (CI-pinned). Tier 2 and Tier 3 are entirely different datasets — true generalisation, not just train/test split. - Module independence. Each
server/module imports only fromdental_constantsandquaternion_utils. No cross-module state. Makes the codebase testable (227 passing tests) and easy to reason about. - Embedded environment for training. During GRPO, the env runs in-process (no FastAPI hop). Episode results are cached per (completion, seed) so all 5 reward functions share one rollout instead of paying for it 5 times.
make test # 227 passing in ~60 s
make fast-check # high-value subset (<5 s); pre-commit hook
make install-precommit # repo-local hook so the next regression doesn't ship- The PDL biomechanical model is simplified (Kelvin-Voigt spring, not full FEA).
- Collision detection has both ellipsoid (fast) and mesh-based (accurate) modes; the synthetic case generator uses ellipsoids only.
- The dataset has crown landmarks but not root segmentation — root-resorption risk can only be approximated.
- Single-agent MDP; real treatment planning involves clinician-patient negotiation and multi-visit feedback.
- Wang et al. (2024). A 3D dental model dataset with pre/post-orthodontic treatment for automatic tooth alignment. Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
- Andrews LF (1972). The six keys to normal occlusion. Am J Orthod 62(3):296–309.
- Proffit WR. Contemporary Orthodontics 6th ed., Ch. 8 — bone-remodelling timeline.
- Cattaneo PM et al. (2005). Moment-to-force ratio, centre of rotation, and force level. Am J Orthod Dentofacial Orthop.
- Shao Z et al. (2024). DeepSeekMath (GRPO). arXiv:2402.03300
- Yuan Z et al. (2023). Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (RFT). arXiv:2308.01825
- Patwardhan S et al. (2025). GDPval. arXiv:2510.04374
- Grand View Research (2025). Clear Aligners Market Report — $8.29 B market, projected $56.81 B by 2033.
MIT. Patient data inherits the original Tsinghua / Open-Full-Jaw / Bits2Bites licenses.






