open-r1/codeforces-cots
Viewer • Updated • 254k • 4.49k • 221
Reproduction of Ben Burtenshaw's HuggingFace fine-tuning challenge (Claude Code vs Codex). Fine-tuned using SFT on the solutions_py subset of open-r1/codeforces-cots.
| Model | Score | Problems Passed |
|---|---|---|
| Base (Qwen3-0.6B) | 40.24% | 66/164 |
| Fine-tuned | 40.85% | 67/164 |
| Improvement | +0.61% | +1 problem |
Same script, same hardware, 3 parallel runs:
| Run | Score | Result |
|---|---|---|
| 1 | 40.85% | Win (+1) |
| 2 | 40.24% | Tie |
| 3 | 39.63% | Loss (-1) |
Fine-tuning has randomness. Multiple attempts are expected.
The default codeforces-cots dataset is ~90% C++. Training on it for a Python benchmark (HumanEval) hurt performance in early attempts. Using the solutions_py subset doubled the baseline from ~18% to 40%.
Domain alignment > data quantity.
open-r1/codeforces-cots (solutions_py subset) - 500 examplesHF_TOKEN as a secrethf jobs uv run \
--flavor a10g-small \
--timeout 14400 \
--secrets HF_TOKEN \
"https://huggingface.co/passagereptile455/training-scripts/resolve/main/train_humaneval_clean.py"
Script: passagereptile455/training-scripts
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("passagereptile455/qwen3-humaneval-sft")
tokenizer = AutoTokenizer.from_pretrained("passagereptile455/qwen3-humaneval-sft")
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What broke along the way (5 attempts total):
processing_class not tokenizerHfApi(token=) not login()token= to push_to_hub() explicitlyMost time was spent on infrastructure debugging, not ML.
Based on Ben Burtenshaw's challenge comparing Claude Code vs Codex for fine-tuning tasks.