Skip to content

mehular0ra/orthorl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title OrthoRL — Orthodontic Treatment Planning Environment
emoji 🦷
colorFrom blue
colorTo green
sdk docker
app_port 7860
tags
openenv
dental
orthodontics
reinforcement-learning
se3
medical-ai

🦷 OrthoRL — battisiBot v2

Can an LLM agent learn to plan orthodontic aligner treatment, stage by stage, under delayed bone biomechanics?

The first reinforcement-learning environment for per-stage clear-aligner trajectory planning. 24 sequential decisions. 28 teeth in SE(3) space. 1,063 real patient profiles from Beijing Stomatological Hospital. Five algorithmic reward functions. Zero LLM-as-judge.


🤗 Live HF Space 📺 Watch Demo Read the Blog Model

OpenEnv Hackathon India 2026 Theme 3.1 Theme 5 Tests License


OrthoRL dashboard hero

The live dashboard. Drag the 3D ellipsoids to inspect a real Tsinghua patient's pre-treatment and post-treatment tooth poses.


⚡ TL;DR

Clear-aligner planning is an $8 B → $56 B market with a brutal failure mode: only 6% of patients finish on the original plan, and 1 in 6 switches to braces because the digital plan never tracked. The mechanism is structural — planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological, so teeth lose tracking and plans require mid-course corrections.

OrthoRL is the first reinforcement-learning environment for per-stage orthodontic aligner trajectory planning. An LLM agent diagnoses the case, picks a treatment strategy, and plans 24 sequential aligner stages for 28 teeth in SE(3) space under a pharmacokinetic force decay model that delays the actual displacement by ~2 stages. Five algorithmic reward functions. Zero LLM-as-judge.

We trained Qwen2.5-3B end-to-end through a 5-stage pipeline (3 SFT sub-stages → GRPO → RFT) on 1,063 real Tsinghua patients. Result: 0.72 → 0.86 mean reward on 250 unseen patients, with the entire left-tail of failures eliminated.


🔗 Quick links

🌐 Live HF Space https://grimoors-orthorl.hf.space
🤗 Repo on Hugging Face https://huggingface.co/spaces/Grimoors/orthorl
🤖 Trained model (LoRA adapter) https://huggingface.co/sri-manikanta/orthorl-grpo
📺 Demo video https://youtu.be/MK6gfw3ZKz0
📖 Full blog BLOG.md
📊 OpenAPI docs https://grimoors-orthorl.hf.space/docs
❤️ Health probe https://grimoors-orthorl.hf.space/health
📦 GitHub https://github.com/mehular0ra/orthorl

The Story: From Straight Lines to Surgical Precision

Act 1: The Refinement Trap

Every year 12 million patients start clear-aligner treatment. Today's planning software defines a target occlusion and runs straight-line interpolation (SLERP) from the patient's current poses to the target. It looks beautiful in software. It breaks in the mouth.

Why? Bone is not glass. The periodontal ligament responds over weeks. Forces applied at stage N don't produce their full effect until stage N+2 or N+3 (Proffit, Contemporary Orthodontics Ch. 8). A trajectory that looks perfect on a CAD viewer overshoots by ~30% on hard cases. Teeth lose tracking. Plans need refinement scans. Only 6% of patients finish on the original plan. 1 in 6 switches to braces.

That gap is why OrthoRL exists.

Act 2: A Verifiable Long-Horizon RL Problem

Two years ago you couldn't have built this. Three things had to land first:

  1. Real patient data at scale. Wang et al.'s 2024 Nature Scientific Data release of 1,063 Tsinghua patients with pre- and post-treatment tooth landmarks. We use 200 of them with 8 crown landmarks per tooth in FDI notation as the SE(3) ground truth.
  2. A verifiable reward. Andrews' Six Keys to Normal Occlusion (1972) gives 9 algorithmic geometric metrics computed directly from SE(3) poses. Combined with Kelvin-Voigt periodontal-ligament biomechanics, mesh-based collision detection, and an empirical anchorage prior mined from the data itself — five independent reward functions, all deterministic given seed.
  3. Group-Relative Policy Optimisation. GRPO eliminates the value network and uses group-relative baselines, making it tractable to train an LLM agent through 24-step rollouts at 4 generations per prompt.

OrthoRL puts an LLM agent in the orthodontist's chair: diagnose the case, pick a strategy, plan 24 aligner stages, get penalised for collisions and over-anchored molars, get rewarded for clinically realistic incisor-first staging — and learn the lead pattern that real orthodontists do intuitively (apply force 2 stages ahead of the desired tooth position).

Act 3: The Agent Beats SLERP By Eliminating Failures

After the SFT → GRPO → RFT chain, on 250 held-out Tsinghua patients the agent never saw:

Policy Mean reward Std Left-tail (< 0.6)
SLERP baseline 0.72 0.12 ~25 %
OrthoRL trained 0.86 0.04 ~0 %

The trained agent doesn't just shift the average — it eliminates the catastrophic failures that drive the refinement trap. Tighter whiskers, no left tail, clinically realistic staging strategy choices.


Problem Statements Addressed

Primary: Theme 3.1 — World Modeling for Professional Tasks

OrthoRL is a textbook Theme 3.1 fit. The agent makes real interactions with tools (8 of them — diagnose, simulate, check collisions, commit, rollback, measure crowding/overbite/Angle class), maintains internal state across 24 steps, and updates its beliefs through tool calls and observations. The reward verifies professional task quality (Andrews' Six Keys, biomechanical compliance, treatment-strategy correctness, anchorage realism). Same initial state admits multiple valid trajectories with real clinical tradeoffs (Class II → distalisation vs extraction vs Class II elastics).

Secondary: Theme 5 — Wild Card

Pharmacokinetic force decay is, to our knowledge, the only RL environment with a non-Markov delayed-reward mechanism grounded in human biology. Other envs use delayed rewards as abstractions; OrthoRL models the actual 0–8 week bone-remodelling impulse response as a 5-tap convolution kernel [0.10, 0.30, 0.40, 0.15, 0.05] (Proffit Ch. 8 + Cattaneo 2005). The agent must learn temporal credit assignment over a 2–3 stage horizon — the exact skill that reduces refinement rates in clinical practice.

Why dentistry, why now

OpenAI's GDPval benchmark measured frontier models against human experts on 44 economically-valuable expert occupations in the top-9 GDP-contributing US sectors. Dentistry is not on that list. OrthoRL is the training environment GDPval skipped — same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where AI automation is documented at ~80× planning speedup and reducing refinements saves ~20% of aligners per case.


How It Works

The loop

Agent receives observation:
  ┌─────────────────────────────────────────────┐
  │  28 teeth × [qw,qx,qy,qz, tx,ty,tz]         │
  │  Current poses  +  Target poses             │
  │  Per-tooth progress (0–100%)                │
  │  Stage number (1–24)                        │
  │  Reward breakdown from previous stage       │
  └─────────────────────────────────────────────┘
              │
              ▼
Agent emits an action (JSON):
  {
    "strategy": "anterior_first",
    "tooth_groups": [
      {"teeth": [11,12,21,22], "fraction": 0.6, "priority": "high"},
      {"teeth": [16,17,26,27], "fraction": 0.2, "priority": "low"}
    ]
  }
              │
              ▼
Environment applies the action:
  ┌─────────────────────────────────────────────┐
  │  1. Parse fractions → SLERP interpolation   │
  │  2. Enforce 0.25 mm / 2° per-stage limits   │
  │  3. Apply force-decay convolution           │
  │  4. Check inter-tooth mesh collisions       │
  │  5. Compute PDL stress feasibility          │
  │  6. Grade via 5 reward functions            │
  │  7. Return new observation + reward         │
  └─────────────────────────────────────────────┘

What makes this different from existing AI-in-orthodontics work

Prior work What it does What OrthoRL adds
Li & Wang 2025 (Sci Rep) — only published RL paper Coarse clinical decisions (extraction vs surgery) Per-stage trajectory planning over 24 commits
CLIK-Diffusion 2025 (MedIA) Diffusion model predicts the final alignment The 24 intermediate stages between init and target
Dong & Chen 2025 (ICCV) Transformer + collision constraints Multi-step decision making with delayed reward
TAlignDiff 2024 Diffusion target prediction A Gymnasium-compatible RL environment, not a one-shot model

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        OrthoRL System                            │
│                                                                  │
│  ┌─────────────┐     ┌──────────────────────────────────────┐    │
│  │  FastAPI    │     │     RL Environment (server/)         │    │
│  │  Server     │────▶│  DentalAlignerEnvironment            │    │
│  │  /reset     │     │    ├─ SyntheticCaseGenerator         │    │
│  │  /step      │     │    ├─ LandmarkLoader (real patients) │    │
│  │  /demo_run  │     │    ├─ ForceDecayModel (5-tap kernel) │    │
│  │  /docs      │     │    ├─ AdversarialJitter              │    │
│  └─────────────┘     │    └─ ClinicalProfileManager         │    │
│                      │                                      │    │
│  ┌─────────────┐     │  Grading Stack                       │    │
│  │  GRPO       │     │    ├─ AlignerGrader (4 components)   │    │
│  │  Trainer    │────▶│    ├─ OcclusionScorer (9 Andrews)    │    │
│  │  (embedded) │     │    ├─ PDLModel (Kelvin-Voigt)        │    │
│  │  Qwen2.5-3B │     │    ├─ CollisionDetector (mesh)       │    │
│  │  + LoRA r=16│     │    └─ RewardScaler ([-2, +8])        │    │
│  └─────────────┘     └──────────────────────────────────────┘    │
│  ┌─────────────┐     ┌──────────────────────────────────────┐    │
│  │  Datasets   │     │     Evaluation                       │    │
│  │  Tsinghua   │────▶│  3-tier held-out:                    │    │
│  │  1,063 pts  │     │    Tier 1: 250 Tsinghua test         │    │
│  │  OFJ 17 pts │     │    Tier 2: 17 Open-Full-Jaw          │    │
│  └─────────────┘     │    Tier 3: 40 Bits2Bites             │    │
│                      └──────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Each server/ module is self-contained — imports only from dental_constants and quaternion_utils (shared definitional constants and math). No cross-module dependencies. 227 passing tests as the contract.


Reward Engineering

OrthoRL uses 5 algorithmic reward functionsno LLM-as-judge. Every reward is deterministic given the seed.

# Reward Range What it measures
1 Terminal [-2, +8] 4-component score (final accuracy 40 % · smoothness 20 % · compliance 20 % · staging quality 20 %) scaled with hard-fail overrides for collisions / PDL stress
2 Andrews' Occlusion [0, 1] 9 metrics from Andrews' Six Keys: molar relationship, overjet, overbite, crown angulation/inclination, rotations, contact tightness, curve of Spee, arch symmetry
3 Strategy {0, 0.5, 1} 1.2× / 1.0× / 0.6× multiplier for optimal / neutral / contraindicated treatment strategy given diagnosis
4 Format [0, 1] Partial-credit JSON validity check across 5 sub-checks
5 Anchorage [0, 1] Empirical prior mined from 195 real patients (n=5,089 tooth-class observations) — penalises unrealistic molar displacement above the data-driven 90th percentile

Anti-hacking: five independent rewards (gaming one doesn't help with the others) + hard-fail overrides for collisions and PDL stress + sample completions logged every 10 steps for inspection + algorithmic-only grading (no prompt-injectable judge).


Training Signal

5-stage progressive pipeline

Stage 0: Format SFT     ──→  Model learns JSON output structure
         │
Stage 1: Tool-use SFT   ──→  Model learns tool-calling grammar
         │
Stage 2: BC SFT         ──→  Model imitates expert staging oracle
         │
Stage 3: GRPO           ──→  Model improves via 5 reward signals
         │
Stage 4: Rejection FT   ──→  Variance reduction on best completions

Each stage emits a checkpoint with a gate_passed.json that the next stage validates before starting — no stage runs on a broken predecessor. The interactive Training Stages dashboard renders per-stage curves in three click-to-switch panels.

Configuration

Parameter Value
Base model Qwen2.5-3B-Instruct (4-bit via Unsloth)
LoRA r = 16, α = 32, target = q/k/v/o + gate/up/down_proj
GRPO steps 300
Generations per prompt (G) 4
Learning rate 5 × 10⁻⁶ (warmup + linear decay)
Reward range [-2, +8]
Compute Azure ML — ND96asr v4 (A100 80 GB)
Cost ~$3.50 / full pipeline run

Results

Headline: SLERP vs OrthoRL on 250 unseen patients

SLERP vs OrthoRL

Policy Mean reward Std Left-tail (< 0.6)
SLERP baseline 0.72 0.12 ~25 %
OrthoRL trained 0.86 0.04 ~0 %

Robustness — left-tail elimination

Per-patient reward distribution shift

The grey distribution (SLERP) has a long left tail of refinement-trap failures. The green distribution (OrthoRL) is narrower, shifted right, and the entire left tail is gone. Robustness, not just average performance.

What the agent learned (from reward signal alone)

  • Anterior-first staging. Real orthodontists move incisors before molars; the trained agent does the same despite no rule saying so.
  • Empirical anchorage. Median molar displacement matches the 0.89 mm population median mined from 195 real treatments.
  • Diagnosis-→-strategy alignment. Class II cases preferentially get distalisation; Class III gets reverse-anchorage.
  • Force-lead pattern. The agent applies more force in early stages on hard cases — the same compensation real clinicians make for delayed bone response.

Per-stage diagnostics

SFT loss collapse
GRPO per-component reward
GRPO loss + mean reward
Policy entropy (no mode collapse)

The full per-stage diagnostic picture lives in the BLOG.md with all 9 plots and stage-by-stage commentary.


Quick Start

git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app          # FastAPI on :7860
make test                            # 227 tests pass in ~60 s

Open the dashboard in a browser at the local URL the server prints.

Try the live deployment

# Health probe
curl https://grimoors-orthorl.hf.space/health
# {"status":"healthy"}

# Reset to a real Tsinghua patient
curl -X POST https://grimoors-orthorl.hf.space/reset_stepwise \
  -H 'Content-Type: application/json' \
  -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'

Training (with HF TRL + Unsloth)

# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh

# or just one stage (re-run GRPO without re-running SFT)
STAGES="3" bash scripts/run_full_pipeline.sh

# evaluate against the locked SLERP baseline
uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft  --tier 1 --seeds 3

# regenerate slide-ready plots
uv run python scripts/build_demo_plots.py
uv run python scripts/build_ablation_matrix.py

Colab driver: notebooks/colab_a10g.py.

Logging — what gets captured automatically

Artefact Frequency Content
logs/run_<TS>.log streaming (tee) every stdout line of every stage
logs/grpo_samples.jsonl every 10 steps step / loss / lr / per-reward-fn means / completion length
logs/grpo_completions.jsonl every 10 steps full completion text + slim breakdown
checkpoints/grpo/trainer_state.json every TRL step TRL log_history with all reward fns
research/results.tsv once at training end autolog row per multiautoresearch discipline
assets/static/training/*.png once at training end matplotlib plots — embedded in the dashboard
W&B run URL every step live cloud dashboard (auto-on if WANDB_API_KEY is set)

Deployment on HF Spaces

# Dockerfile uses python:3.11-slim + uv
# Serves OpenEnv FastAPI + dashboard on port 7860
huggingface-cli login                                    # paste a write-scoped token
git remote add hf https://huggingface.co/spaces/<user>/orthorl
git push hf main:main

The Space frontmatter lives in this README's YAML header below — sdk: docker, app_port: 7860. Every push triggers an automatic Docker rebuild. The current live deployment is at https://huggingface.co/spaces/Grimoors/orthorl.


Project Structure

orthorl/
├── server/                      # OpenEnv FastAPI environment
│   ├── app.py                   #   create_fastapi_app + dashboard HTML
│   ├── dental_environment.py    #   StepwiseDentalEnvironment (24-step)
│   ├── grader.py                #   IMMUTABLE — terminal reward
│   ├── occlusion_scorer.py      #   Andrews' Six Keys (9 metrics)
│   ├── pdl_model.py             #   Kelvin-Voigt PDL biomechanics
│   ├── force_decay.py           #   5-tap pharmacokinetic kernel
│   ├── collision_detector.py    #   Oriented mesh / ellipsoid
│   ├── adversarial.py           #   Patient non-compliance
│   ├── landmark_loader.py       #   200 real Tsinghua patients
│   ├── clinical_profiles.py     #   1,063 patient profiles
│   ├── eval_split.py            #   Frozen 250 + 17 + 40 IDs
│   └── ...                      #   (31 modules, all self-contained)
├── scripts/                     # SFT / GRPO / RFT / eval drivers
├── tests/                       # 227 pytest cases
├── data/                        # SFT JSONL datasets
├── datasets/                    # Tsinghua + OFJ + Bits2Bites
├── assets/static/               # Dashboard plots + 3D HTMLs (deployed)
├── notebooks/colab_a10g.py      # One-cell Colab launcher
├── train_grpo.py                # Main GRPO entrypoint
├── eval.py                      # Held-out evaluation CLI
├── prepare.py                   # IMMUTABLE benchmark
├── BLOG.md                      # Full hackathon write-up
├── README.md                    # ← you are here
├── Dockerfile                   # python:3.11-slim + uv
├── pyproject.toml + uv.lock     # Locked deps
└── openenv.yaml                 # OpenEnv manifest

Key Design Decisions

  1. Algorithmic rewards only. No LLM-as-judge anywhere in the reward chain. Every reward function is deterministic given the seed, eliminating prompt-injection reward hacking and making training reproducible.
  2. Read-only benchmark. prepare.py and server/grader.py are immutable during experiments. Once a result is recorded in research/results.tsv, the benchmark cannot change retroactively.
  3. One hypothesis per training run. Each completed run is appended to research/results.tsv (append-only); promotion to research/live/master.json requires beating the current master metric.
  4. Three-tier held-out eval. 250 Tsinghua + 17 OFJ + 40 Bits2Bites IDs are frozen in server/eval_split.py, and EvalRegistry.assert_training_legal() raises on any leakage in train mode (CI-pinned). Tier 2 and Tier 3 are entirely different datasets — true generalisation, not just train/test split.
  5. Module independence. Each server/ module imports only from dental_constants and quaternion_utils. No cross-module state. Makes the codebase testable (227 passing tests) and easy to reason about.
  6. Embedded environment for training. During GRPO, the env runs in-process (no FastAPI hop). Episode results are cached per (completion, seed) so all 5 reward functions share one rollout instead of paying for it 5 times.

Tests

make test                # 227 passing in ~60 s
make fast-check          # high-value subset (<5 s); pre-commit hook
make install-precommit   # repo-local hook so the next regression doesn't ship

Limitations

  • The PDL biomechanical model is simplified (Kelvin-Voigt spring, not full FEA).
  • Collision detection has both ellipsoid (fast) and mesh-based (accurate) modes; the synthetic case generator uses ellipsoids only.
  • The dataset has crown landmarks but not root segmentation — root-resorption risk can only be approximated.
  • Single-agent MDP; real treatment planning involves clinician-patient negotiation and multi-visit feedback.

References

  1. Wang et al. (2024). A 3D dental model dataset with pre/post-orthodontic treatment for automatic tooth alignment. Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
  2. Andrews LF (1972). The six keys to normal occlusion. Am J Orthod 62(3):296–309.
  3. Proffit WR. Contemporary Orthodontics 6th ed., Ch. 8 — bone-remodelling timeline.
  4. Cattaneo PM et al. (2005). Moment-to-force ratio, centre of rotation, and force level. Am J Orthod Dentofacial Orthop.
  5. Shao Z et al. (2024). DeepSeekMath (GRPO). arXiv:2402.03300
  6. Yuan Z et al. (2023). Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (RFT). arXiv:2308.01825
  7. Patwardhan S et al. (2025). GDPval. arXiv:2510.04374
  8. Grand View Research (2025). Clear Aligners Market Report — $8.29 B market, projected $56.81 B by 2033.

License

MIT. Patient data inherits the original Tsinghua / Open-Full-Jaw / Bits2Bites licenses.

About

OrthoRL — first RL environment for orthodontic aligner staging. 24-step SE(3) planning, 5 algorithmic rewards, 1,063 real Tsinghua patients. OpenEnv Hackathon India 2026.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages