🦷 OrthoRL — battisiBot v2

title

OrthoRL — Orthodontic Treatment Planning Environment

emoji

🦷

colorFrom

blue

colorTo

green

sdk

docker

app_port

7860

🦷 OrthoRL — battisiBot v2

Can an LLM agent learn to plan orthodontic aligner treatment, stage by stage, under delayed bone biomechanics?

The first reinforcement-learning environment for per-stage clear-aligner trajectory planning. 24 sequential decisions. 28 teeth in SE(3) space. 1,063 real patient profiles from Beijing Stomatological Hospital. Five algorithmic reward functions. Zero LLM-as-judge.

The live dashboard. Drag the 3D ellipsoids to inspect a real Tsinghua patient's pre-treatment and post-treatment tooth poses.

⚡ TL;DR

Clear-aligner planning is an $8 B → $56 B market with a brutal failure mode: only 6% of patients finish on the original plan, and 1 in 6 switches to braces because the digital plan never tracked. The mechanism is structural — planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological, so teeth lose tracking and plans require mid-course corrections.

OrthoRL is the first reinforcement-learning environment for per-stage orthodontic aligner trajectory planning. An LLM agent diagnoses the case, picks a treatment strategy, and plans 24 sequential aligner stages for 28 teeth in SE(3) space under a pharmacokinetic force decay model that delays the actual displacement by ~2 stages. Five algorithmic reward functions. Zero LLM-as-judge.

We trained Qwen2.5-3B end-to-end through a 5-stage pipeline (3 SFT sub-stages → GRPO → RFT) on 1,063 real Tsinghua patients. Result: 0.72 → 0.86 mean reward on 250 unseen patients, with the entire left-tail of failures eliminated.

🔗 Quick links


🌐 Live HF Space	https://grimoors-orthorl.hf.space
🤗 Repo on Hugging Face	https://huggingface.co/spaces/Grimoors/orthorl
🤖 Trained model (LoRA adapter)	https://huggingface.co/sri-manikanta/orthorl-grpo
📺 Demo video	https://youtu.be/MK6gfw3ZKz0
📖 Full blog	BLOG.md
📊 OpenAPI docs	https://grimoors-orthorl.hf.space/docs
❤️ Health probe	https://grimoors-orthorl.hf.space/health
📦 GitHub	https://github.com/mehular0ra/orthorl

The Story: From Straight Lines to Surgical Precision

Act 1: The Refinement Trap

Every year 12 million patients start clear-aligner treatment. Today's planning software defines a target occlusion and runs straight-line interpolation (SLERP) from the patient's current poses to the target. It looks beautiful in software. It breaks in the mouth.

Why? Bone is not glass. The periodontal ligament responds over weeks. Forces applied at stage N don't produce their full effect until stage N+2 or N+3 (Proffit, Contemporary Orthodontics Ch. 8). A trajectory that looks perfect on a CAD viewer overshoots by ~30% on hard cases. Teeth lose tracking. Plans need refinement scans. Only 6% of patients finish on the original plan. 1 in 6 switches to braces.

That gap is why OrthoRL exists.

Act 2: A Verifiable Long-Horizon RL Problem

Two years ago you couldn't have built this. Three things had to land first:

Real patient data at scale. Wang et al.'s 2024 Nature Scientific Data release of 1,063 Tsinghua patients with pre- and post-treatment tooth landmarks. We use 200 of them with 8 crown landmarks per tooth in FDI notation as the SE(3) ground truth.
A verifiable reward. Andrews' Six Keys to Normal Occlusion (1972) gives 9 algorithmic geometric metrics computed directly from SE(3) poses. Combined with Kelvin-Voigt periodontal-ligament biomechanics, mesh-based collision detection, and an empirical anchorage prior mined from the data itself — five independent reward functions, all deterministic given seed.
Group-Relative Policy Optimisation. GRPO eliminates the value network and uses group-relative baselines, making it tractable to train an LLM agent through 24-step rollouts at 4 generations per prompt.

OrthoRL puts an LLM agent in the orthodontist's chair: diagnose the case, pick a strategy, plan 24 aligner stages, get penalised for collisions and over-anchored molars, get rewarded for clinically realistic incisor-first staging — and learn the lead pattern that real orthodontists do intuitively (apply force 2 stages ahead of the desired tooth position).

Act 3: The Agent Beats SLERP By Eliminating Failures

After the SFT → GRPO → RFT chain, on 250 held-out Tsinghua patients the agent never saw:

Policy	Mean reward	Std	Left-tail (< 0.6)
SLERP baseline	0.72	0.12	~25 %
OrthoRL trained	0.86	0.04	~0 %

The trained agent doesn't just shift the average — it eliminates the catastrophic failures that drive the refinement trap. Tighter whiskers, no left tail, clinically realistic staging strategy choices.

Problem Statements Addressed

Primary: Theme 3.1 — World Modeling for Professional Tasks

OrthoRL is a textbook Theme 3.1 fit. The agent makes real interactions with tools (8 of them — diagnose, simulate, check collisions, commit, rollback, measure crowding/overbite/Angle class), maintains internal state across 24 steps, and updates its beliefs through tool calls and observations. The reward verifies professional task quality (Andrews' Six Keys, biomechanical compliance, treatment-strategy correctness, anchorage realism). Same initial state admits multiple valid trajectories with real clinical tradeoffs (Class II → distalisation vs extraction vs Class II elastics).

Secondary: Theme 5 — Wild Card

Pharmacokinetic force decay is, to our knowledge, the only RL environment with a non-Markov delayed-reward mechanism grounded in human biology. Other envs use delayed rewards as abstractions; OrthoRL models the actual 0–8 week bone-remodelling impulse response as a 5-tap convolution kernel [0.10, 0.30, 0.40, 0.15, 0.05] (Proffit Ch. 8 + Cattaneo 2005). The agent must learn temporal credit assignment over a 2–3 stage horizon — the exact skill that reduces refinement rates in clinical practice.

Why dentistry, why now

OpenAI's GDPval benchmark measured frontier models against human experts on 44 economically-valuable expert occupations in the top-9 GDP-contributing US sectors. Dentistry is not on that list. OrthoRL is the training environment GDPval skipped — same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where AI automation is documented at ~80× planning speedup and reducing refinements saves ~20% of aligners per case.

How It Works

The loop

Agent receives observation:
  ┌─────────────────────────────────────────────┐
  │  28 teeth × [qw,qx,qy,qz, tx,ty,tz]         │
  │  Current poses  +  Target poses             │
  │  Per-tooth progress (0–100%)                │
  │  Stage number (1–24)                        │
  │  Reward breakdown from previous stage       │
  └─────────────────────────────────────────────┘
              │
              ▼
Agent emits an action (JSON):
  {
    "strategy": "anterior_first",
    "tooth_groups": [
      {"teeth": [11,12,21,22], "fraction": 0.6, "priority": "high"},
      {"teeth": [16,17,26,27], "fraction": 0.2, "priority": "low"}
    ]
  }
              │
              ▼
Environment applies the action:
  ┌─────────────────────────────────────────────┐
  │  1. Parse fractions → SLERP interpolation   │
  │  2. Enforce 0.25 mm / 2° per-stage limits   │
  │  3. Apply force-decay convolution           │
  │  4. Check inter-tooth mesh collisions       │
  │  5. Compute PDL stress feasibility          │
  │  6. Grade via 5 reward functions            │
  │  7. Return new observation + reward         │
  └─────────────────────────────────────────────┘

What makes this different from existing AI-in-orthodontics work

Prior work	What it does	What OrthoRL adds
Li & Wang 2025 (Sci Rep) — only published RL paper	Coarse clinical decisions (extraction vs surgery)	Per-stage trajectory planning over 24 commits
CLIK-Diffusion 2025 (MedIA)	Diffusion model predicts the final alignment	The 24 intermediate stages between init and target
Dong & Chen 2025 (ICCV)	Transformer + collision constraints	Multi-step decision making with delayed reward
TAlignDiff 2024	Diffusion target prediction	A Gymnasium-compatible RL environment, not a one-shot model

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        OrthoRL System                            │
│                                                                  │
│  ┌─────────────┐     ┌──────────────────────────────────────┐    │
│  │  FastAPI    │     │     RL Environment (server/)         │    │
│  │  Server     │────▶│  DentalAlignerEnvironment            │    │
│  │  /reset     │     │    ├─ SyntheticCaseGenerator         │    │
│  │  /step      │     │    ├─ LandmarkLoader (real patients) │    │
│  │  /demo_run  │     │    ├─ ForceDecayModel (5-tap kernel) │    │
│  │  /docs      │     │    ├─ AdversarialJitter              │    │
│  └─────────────┘     │    └─ ClinicalProfileManager         │    │
│                      │                                      │    │
│  ┌─────────────┐     │  Grading Stack                       │    │
│  │  GRPO       │     │    ├─ AlignerGrader (4 components)   │    │
│  │  Trainer    │────▶│    ├─ OcclusionScorer (9 Andrews)    │    │
│  │  (embedded) │     │    ├─ PDLModel (Kelvin-Voigt)        │    │
│  │  Qwen2.5-3B │     │    ├─ CollisionDetector (mesh)       │    │
│  │  + LoRA r=16│     │    └─ RewardScaler ([-2, +8])        │    │
│  └─────────────┘     └──────────────────────────────────────┘    │
│  ┌─────────────┐     ┌──────────────────────────────────────┐    │
│  │  Datasets   │     │     Evaluation                       │    │
│  │  Tsinghua   │────▶│  3-tier held-out:                    │    │
│  │  1,063 pts  │     │    Tier 1: 250 Tsinghua test         │    │
│  │  OFJ 17 pts │     │    Tier 2: 17 Open-Full-Jaw          │    │
│  └─────────────┘     │    Tier 3: 40 Bits2Bites             │    │
│                      └──────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Each server/ module is self-contained — imports only from dental_constants and quaternion_utils (shared definitional constants and math). No cross-module dependencies. 227 passing tests as the contract.

Reward Engineering

OrthoRL uses 5 algorithmic reward functions — no LLM-as-judge. Every reward is deterministic given the seed.

#	Reward	Range	What it measures
1	Terminal	[-2, +8]	4-component score (final accuracy 40 % · smoothness 20 % · compliance 20 % · staging quality 20 %) scaled with hard-fail overrides for collisions / PDL stress
2	Andrews' Occlusion	[0, 1]	9 metrics from Andrews' Six Keys: molar relationship, overjet, overbite, crown angulation/inclination, rotations, contact tightness, curve of Spee, arch symmetry
3	Strategy	{0, 0.5, 1}	1.2× / 1.0× / 0.6× multiplier for optimal / neutral / contraindicated treatment strategy given diagnosis
4	Format	[0, 1]	Partial-credit JSON validity check across 5 sub-checks
5	Anchorage	[0, 1]	Empirical prior mined from 195 real patients (n=5,089 tooth-class observations) — penalises unrealistic molar displacement above the data-driven 90th percentile

Anti-hacking: five independent rewards (gaming one doesn't help with the others) + hard-fail overrides for collisions and PDL stress + sample completions logged every 10 steps for inspection + algorithmic-only grading (no prompt-injectable judge).

Training Signal

5-stage progressive pipeline

Stage 0: Format SFT     ──→  Model learns JSON output structure
         │
Stage 1: Tool-use SFT   ──→  Model learns tool-calling grammar
         │
Stage 2: BC SFT         ──→  Model imitates expert staging oracle
         │
Stage 3: GRPO           ──→  Model improves via 5 reward signals
         │
Stage 4: Rejection FT   ──→  Variance reduction on best completions

Each stage emits a checkpoint with a gate_passed.json that the next stage validates before starting — no stage runs on a broken predecessor. The interactive Training Stages dashboard renders per-stage curves in three click-to-switch panels.

Configuration

Parameter	Value
Base model	Qwen2.5-3B-Instruct (4-bit via Unsloth)
LoRA	r = 16, α = 32, target = q/k/v/o + gate/up/down_proj
GRPO steps	300
Generations per prompt (G)	4
Learning rate	5 × 10⁻⁶ (warmup + linear decay)
Reward range	[-2, +8]
Compute	Azure ML — ND96asr v4 (A100 80 GB)
Cost	~$3.50 / full pipeline run

Results

Headline: SLERP vs OrthoRL on 250 unseen patients

Policy	Mean reward	Std	Left-tail (< 0.6)
SLERP baseline	0.72	0.12	~25 %
OrthoRL trained	0.86	0.04	~0 %

Robustness — left-tail elimination

The grey distribution (SLERP) has a long left tail of refinement-trap failures. The green distribution (OrthoRL) is narrower, shifted right, and the entire left tail is gone. Robustness, not just average performance.

What the agent learned (from reward signal alone)

Anterior-first staging. Real orthodontists move incisors before molars; the trained agent does the same despite no rule saying so.
Empirical anchorage. Median molar displacement matches the 0.89 mm population median mined from 195 real treatments.
Diagnosis-→-strategy alignment. Class II cases preferentially get distalisation; Class III gets reverse-anchorage.
Force-lead pattern. The agent applies more force in early stages on hard cases — the same compensation real clinicians make for delayed bone response.

Per-stage diagnostics

SFT loss collapse	GRPO per-component reward
GRPO loss + mean reward	Policy entropy (no mode collapse)

The full per-stage diagnostic picture lives in the BLOG.md with all 9 plots and stage-by-stage commentary.

Quick Start

git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app          # FastAPI on :7860
make test                            # 227 tests pass in ~60 s

Open the dashboard in a browser at the local URL the server prints.

Try the live deployment

# Health probe
curl https://grimoors-orthorl.hf.space/health
# {"status":"healthy"}

# Reset to a real Tsinghua patient
curl -X POST https://grimoors-orthorl.hf.space/reset_stepwise \
  -H 'Content-Type: application/json' \
  -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'

Training (with HF TRL + Unsloth)

# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh

# or just one stage (re-run GRPO without re-running SFT)
STAGES="3" bash scripts/run_full_pipeline.sh

# evaluate against the locked SLERP baseline
uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft  --tier 1 --seeds 3

# regenerate slide-ready plots
uv run python scripts/build_demo_plots.py
uv run python scripts/build_ablation_matrix.py

Colab driver: notebooks/colab_a10g.py.

Logging — what gets captured automatically

Artefact	Frequency	Content
`logs/run_<TS>.log`	streaming (tee)	every stdout line of every stage
`logs/grpo_samples.jsonl`	every 10 steps	step / loss / lr / per-reward-fn means / completion length
`logs/grpo_completions.jsonl`	every 10 steps	full completion text + slim breakdown
`checkpoints/grpo/trainer_state.json`	every TRL step	TRL log_history with all reward fns
`research/results.tsv`	once at training end	autolog row per multiautoresearch discipline
`assets/static/training/*.png`	once at training end	matplotlib plots — embedded in the dashboard
W&B run URL	every step	live cloud dashboard (auto-on if `WANDB_API_KEY` is set)

Deployment on HF Spaces

# Dockerfile uses python:3.11-slim + uv
# Serves OpenEnv FastAPI + dashboard on port 7860
huggingface-cli login                                    # paste a write-scoped token
git remote add hf https://huggingface.co/spaces/<user>/orthorl
git push hf main:main

The Space frontmatter lives in this README's YAML header below — sdk: docker, app_port: 7860. Every push triggers an automatic Docker rebuild. The current live deployment is at https://huggingface.co/spaces/Grimoors/orthorl.

Project Structure

orthorl/
├── server/                      # OpenEnv FastAPI environment
│   ├── app.py                   #   create_fastapi_app + dashboard HTML
│   ├── dental_environment.py    #   StepwiseDentalEnvironment (24-step)
│   ├── grader.py                #   IMMUTABLE — terminal reward
│   ├── occlusion_scorer.py      #   Andrews' Six Keys (9 metrics)
│   ├── pdl_model.py             #   Kelvin-Voigt PDL biomechanics
│   ├── force_decay.py           #   5-tap pharmacokinetic kernel
│   ├── collision_detector.py    #   Oriented mesh / ellipsoid
│   ├── adversarial.py           #   Patient non-compliance
│   ├── landmark_loader.py       #   200 real Tsinghua patients
│   ├── clinical_profiles.py     #   1,063 patient profiles
│   ├── eval_split.py            #   Frozen 250 + 17 + 40 IDs
│   └── ...                      #   (31 modules, all self-contained)
├── scripts/                     # SFT / GRPO / RFT / eval drivers
├── tests/                       # 227 pytest cases
├── data/                        # SFT JSONL datasets
├── datasets/                    # Tsinghua + OFJ + Bits2Bites
├── assets/static/               # Dashboard plots + 3D HTMLs (deployed)
├── notebooks/colab_a10g.py      # One-cell Colab launcher
├── train_grpo.py                # Main GRPO entrypoint
├── eval.py                      # Held-out evaluation CLI
├── prepare.py                   # IMMUTABLE benchmark
├── BLOG.md                      # Full hackathon write-up
├── README.md                    # ← you are here
├── Dockerfile                   # python:3.11-slim + uv
├── pyproject.toml + uv.lock     # Locked deps
└── openenv.yaml                 # OpenEnv manifest

Key Design Decisions

Algorithmic rewards only. No LLM-as-judge anywhere in the reward chain. Every reward function is deterministic given the seed, eliminating prompt-injection reward hacking and making training reproducible.
Read-only benchmark. prepare.py and server/grader.py are immutable during experiments. Once a result is recorded in research/results.tsv, the benchmark cannot change retroactively.
One hypothesis per training run. Each completed run is appended to research/results.tsv (append-only); promotion to research/live/master.json requires beating the current master metric.
Three-tier held-out eval. 250 Tsinghua + 17 OFJ + 40 Bits2Bites IDs are frozen in server/eval_split.py, and EvalRegistry.assert_training_legal() raises on any leakage in train mode (CI-pinned). Tier 2 and Tier 3 are entirely different datasets — true generalisation, not just train/test split.
Module independence. Each server/ module imports only from dental_constants and quaternion_utils. No cross-module state. Makes the codebase testable (227 passing tests) and easy to reason about.
Embedded environment for training. During GRPO, the env runs in-process (no FastAPI hop). Episode results are cached per (completion, seed) so all 5 reward functions share one rollout instead of paying for it 5 times.

Tests

make test                # 227 passing in ~60 s
make fast-check          # high-value subset (<5 s); pre-commit hook
make install-precommit   # repo-local hook so the next regression doesn't ship

Limitations

The PDL biomechanical model is simplified (Kelvin-Voigt spring, not full FEA).
Collision detection has both ellipsoid (fast) and mesh-based (accurate) modes; the synthetic case generator uses ellipsoids only.
The dataset has crown landmarks but not root segmentation — root-resorption risk can only be approximated.
Single-agent MDP; real treatment planning involves clinician-patient negotiation and multi-visit feedback.

References

Wang et al. (2024). A 3D dental model dataset with pre/post-orthodontic treatment for automatic tooth alignment. Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
Andrews LF (1972). The six keys to normal occlusion. Am J Orthod 62(3):296–309.
Proffit WR. Contemporary Orthodontics 6th ed., Ch. 8 — bone-remodelling timeline.
Cattaneo PM et al. (2005). Moment-to-force ratio, centre of rotation, and force level. Am J Orthod Dentofacial Orthop.
Shao Z et al. (2024). DeepSeekMath (GRPO). arXiv:2402.03300
Yuan Z et al. (2023). Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (RFT). arXiv:2308.01825
Patwardhan S et al. (2025). GDPval. arXiv:2510.04374
Grand View Research (2025). Clear Aligners Market Report — $8.29 B market, projected $56.81 B by 2033.

License

MIT. Patient data inherits the original Tsinghua / Open-Full-Jaw / Bits2Bites licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
assets/static		assets/static
data		data
datasets		datasets
demo_data		demo_data
docs		docs
notebooks		notebooks
presentation		presentation
research		research
scripts		scripts
server		server
specs		specs
tests		tests
.gitignore		.gitignore
BLOG.md		BLOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
benchmarks.py		benchmarks.py
clinical_exam.py		clinical_exam.py
demo_app.py		demo_app.py
demo_viz.py		demo_viz.py
diff_view.py		diff_view.py
eval.py		eval.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
run_colab.py		run_colab.py
train_grpo.py		train_grpo.py
uv.lock		uv.lock
viz3d.py		viz3d.py

Folders and files

Latest commit

History

Repository files navigation

🦷 OrthoRL — battisiBot v2

Can an LLM agent learn to plan orthodontic aligner treatment, stage by stage, under delayed bone biomechanics?

⚡ TL;DR

🔗 Quick links

The Story: From Straight Lines to Surgical Precision

Act 1: The Refinement Trap

Act 2: A Verifiable Long-Horizon RL Problem

Act 3: The Agent Beats SLERP By Eliminating Failures

Problem Statements Addressed

Primary: Theme 3.1 — World Modeling for Professional Tasks

Secondary: Theme 5 — Wild Card

Why dentistry, why now

How It Works

The loop

What makes this different from existing AI-in-orthodontics work

Architecture

Reward Engineering

Training Signal

5-stage progressive pipeline

Configuration

Results

Headline: SLERP vs OrthoRL on 250 unseen patients

Robustness — left-tail elimination

What the agent learned (from reward signal alone)

Per-stage diagnostics

Quick Start

Try the live deployment

Training (with HF TRL + Unsloth)

Logging — what gets captured automatically

Deployment on HF Spaces

Project Structure

Key Design Decisions

Tests

Limitations

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages