DeepSeek-V4 @ ~130M (TinyStories)
Trained autonomously by the ml-intern Claude Code skill on a single RTX A6000.
This is a ~130M-parameter down-scale of the DeepSeek-V4 architecture (MQA + 512-head_dim semantics, mHC manifold-constrained hyper-connections, hash routing, sqrtsoftplus gate, swiglu_limit, O grouped low-rank, MTP, attention-sink), trained on the roneneldan/TinyStories dataset.
The whole pipeline — paper extraction, architectural down-scale, training script, bug-fixing (a real causal mask leak via the CSA Compressor was caught during a v1 run and patched in v2 by disabling the compressor), self-verification, and this very repo — was produced by a Claude Code session running the ml-intern skill with no human in the loop after the initial prompt.
Sample generation
Prompt: "Once upon a time, the little girl"
Once upon a time, the little girl was walking through the woods. She saw lots of interesting things and started to explore. Suddenly, she saw a big rock! She was so excited and said, "Wow! I'm so hungry!" Just then, a friendly fox came buzzing by. The fox said, "I have something..."
(More samples in gen_samples.log.)
Verified load test
The snippet below is the one in load_test.py in this repo. It was run end-to-end against this published repo on a fresh machine — python load_test.py prints LOAD_TEST: PASS and the sample above.
import importlib.util, json, torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from transformers import AutoTokenizer
REPO = "AlexWortega/ml-intern-v4-100m-tinystories-20260512-1721"
local = snapshot_download(repo_id=REPO)
spec = importlib.util.spec_from_file_location("ds_v4", f"{local}/model.py")
mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)
cfg_dict = json.loads(open(f"{local}/config.json").read())
cfg_dict.pop("_model_class", None)
config = mod.DeepSeekV4Config(**cfg_dict)
model = mod.DeepSeekV4(config)
model.load_state_dict(load_file(f"{local}/model.safetensors"))
model.eval()
# 1. forward pass
x = torch.randint(0, config.vocab_size, (1, 64))
logits = model(x)
assert logits.shape == (1, 64, config.vocab_size)
assert torch.isfinite(logits).all()
# 2. generation
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Once upon a time, the little girl")
x = torch.tensor([ids])
for _ in range(60):
next_logits = model(x[:, -config.max_seq_len:])[0, -1] / 0.8
nxt = torch.multinomial(torch.softmax(next_logits, -1), 1).item()
x = torch.cat([x, torch.tensor([[nxt]])], dim=1)
print(tok.decode(x[0].tolist()))
Actual stdout from the run on eva02:
[2/5] importing DeepSeekV4 from local model.py ...
config: dim=512 n_layers=12 n_heads=8 vocab_size=50257 max_seq_len=512
[3/5] building model + loading safetensors ...
loaded 521 tensors, total params = 128,846,136 (~128.8M)
[4/5] forward pass on random tokens ...
logits shape=(1, 64, 50257) dtype=torch.float32 finite=True
[5/5] generation from 'Once upon a time, the little girl' ...
LOAD_TEST: PASS
Run summary
| metric | value |
|---|---|
| architecture | DeepSeek-V4 (down-scaled, compressor disabled in v2 fix) |
| total parameters | 128,846,136 (~128.8M) |
| dataset | roneneldan/TinyStories (streaming) |
| tokenizer | gpt2 (vocab 50,257) |
| sequence length | 256 |
| batch size | 78 (largest fit at bf16 on A6000 after binary-search probe) |
| precision | bf16, AdamW with fp32 moments |
| optimizer | AdamW lr=3e-4, β=(0.9, 0.95), weight_decay=0.1, grad_clip=1.0 |
| LR schedule | linear warmup 1000 steps → cosine decay to 3e-5 |
| init train loss | 10.94 |
| eval loss @ step 8000 | 1.328 |
| eval-train gap | ~0.04 (healthy tracking) |
| throughput | ~18,400 tok/s on RTX A6000 |
| peak GPU mem | ~45 GB |
| hardware | 1× NVIDIA RTX A6000 (46 GB) |
| skill version | AlexWortega/claude-ml-intern-skill |
Configuration
See config.json. The corresponding dataclass is in model.py as DeepSeekV4Config. Key fields:
vocab_size = 50257 # gpt2 tokenizer
dim = 512
n_layers = 12
n_heads = 8
head_dim = 64
q_lora_rank = 128
o_lora_rank = 64
o_groups = 4
rope_head_dim = 16
window_size = 32
compress_ratios = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) # CSA disabled (v2 fix)
hc_mult = 4
hc_sinkhorn_iters = 3
n_routed_experts = 6
n_shared_experts = 1
n_activated_experts = 2
moe_inter_dim = 512
n_hash_layers = 1
score_func = "sqrtsoftplus"
swiglu_limit = 10.0
n_mtp_layers = 1
max_seq_len = 512
Files in this repo
| file | what it is |
|---|---|
model.safetensors |
float32 weights (converted from step_8000.pt) |
config.json |
actual runtime architecture config |
model.py |
self-contained PyTorch implementation of the V4 architecture |
load_test.py |
end-to-end verified load test (snapshot_download → forward → generate) |
train_v2.py |
the training script that produced this checkpoint |
train.log |
per-step loss, lr, throughput |
eval.log |
per-eval-window evaluation losses |
gen_samples.log |
generation samples at steps 1000 / 5000 / … |
DEBUG.md |
post-mortem of the v1 causal-mask-leak bug and its fix |
TASK.md |
restated task as understood by the agent |
README.md |
this card |
Status
This is a mid-training snapshot at step 8000 of an ongoing full-epoch run (planned ~25k steps). At the time this card was written the run had reached step 10000+ with eval_loss=1.30 and a still-monotonically-improving curve. Later checkpoints will be published at adjacent ml-intern timestamps in the same namespace.
Caveats
- TinyStories is a toy distribution. Don't expect this model to generalize beyond simple children-story prose.
- The model uses a
gpt2tokenizer rather than the DeepSeek tokenizer (the V4 tokenizer is gated and not needed at 130M for TinyStories). - The architecture is faithful to the V4 paper's structural deltas (mHC, hash routing, sqrtsoftplus, swiglu_limit, MTP, attention-sink) but down-scaled aggressively: MoE has 6 routed experts (vs. 384 in V4-Pro), 12 layers (vs. 61), and the CSA Compressor is disabled (
compress_ratios=(0,)*12) because v1 attempts produced a causal mask leak via the compressor's pooling kernel — seeDEBUG.md.
Reproduce the training run
curl -fsSL https://raw.githubusercontent.com/AlexWortega/claude-ml-intern-skill/main/install.sh | bash
# fill ~/.claude/skills/ml-intern/.env with HF_TOKEN (+ optional TG/Slack), then in any claude session:
claude -p --permission-mode bypassPermissions <<'EOF'
/ml-intern train a DeepSeek-V4 architecture at ~130M parameters on roneneldan/TinyStories for one epoch
EOF
License
Apache 2.0.
- Downloads last month
- 1,904