Qwen3-0.6B Fine-tuned on Codeforces-CoTS (Python)

Reproduction of Ben Burtenshaw's HuggingFace fine-tuning challenge (Claude Code vs Codex). Fine-tuned using SFT on the solutions_py subset of open-r1/codeforces-cots.

Results on HumanEval

Model	Score	Problems Passed
Base (Qwen3-0.6B)	40.24%	66/164
Fine-tuned	40.85%	67/164
Improvement	+0.61%	+1 problem

Variance Across Runs

Same script, same hardware, 3 parallel runs:

Run	Score	Result
1	40.85%	Win (+1)
2	40.24%	Tie
3	39.63%	Loss (-1)

Fine-tuning has randomness. Multiple attempts are expected.

Key Insight

The default codeforces-cots dataset is ~90% C++. Training on it for a Python benchmark (HumanEval) hurt performance in early attempts. Using the solutions_py subset doubled the baseline from ~18% to 40%.

Domain alignment > data quantity.

Training Details

Dataset: open-r1/codeforces-cots (solutions_py subset) - 500 examples
Method: LoRA (r=8, alpha=16, dropout=0.05)
Steps: 150
Learning Rate: 5e-6
Batch Size: 2 (gradient accumulation: 4)
Hardware: a10g-small (~$0.75/hr)
Runtime: ~1 hour

Reproduction

Get a HuggingFace Pro account
Set HF_TOKEN as a secret
Run:

hf jobs uv run \
  --flavor a10g-small \
  --timeout 14400 \
  --secrets HF_TOKEN \
  "https://huggingface.co/passagereptile455/training-scripts/resolve/main/train_humaneval_clean.py"

Script: passagereptile455/training-scripts

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("passagereptile455/qwen3-humaneval-sft")
tokenizer = AutoTokenizer.from_pretrained("passagereptile455/qwen3-humaneval-sft")

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Debugging Notes

What broke along the way (5 attempts total):

TRL API changed: processing_class not tokenizer
Auth changed: HfApi(token=) not login()
Upload: Need to pass token= to push_to_hub() explicitly

Most time was spent on infrastructure debugging, not ML.

Acknowledgments

Based on Ben Burtenshaw's challenge comparing Claude Code vs Codex for fine-tuning tasks.

Downloads last month: 6

Safetensors

Model size

0.6B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for passagereptile455/qwen3-humaneval-sft

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Adapter

(431)

this model

passagereptile455
/

qwen3-humaneval-sft