DeepSeek-V4 @ ~130M (TinyStories)

Trained autonomously by the ml-intern Claude Code skill on a single RTX A6000.

This is a ~130M-parameter down-scale of the DeepSeek-V4 architecture (MQA + 512-head_dim semantics, mHC manifold-constrained hyper-connections, hash routing, sqrtsoftplus gate, swiglu_limit, O grouped low-rank, MTP, attention-sink), trained on the roneneldan/TinyStories dataset.

The whole pipeline — paper extraction, architectural down-scale, training script, bug-fixing (a real causal mask leak via the CSA Compressor was caught during a v1 run and patched in v2 by disabling the compressor), self-verification, and this very repo — was produced by a Claude Code session running the ml-intern skill with no human in the loop after the initial prompt.

Sample generation

Prompt: "Once upon a time, the little girl"

Once upon a time, the little girl was walking through the woods. She saw lots of interesting things and started to explore. Suddenly, she saw a big rock! She was so excited and said, "Wow! I'm so hungry!" Just then, a friendly fox came buzzing by. The fox said, "I have something..."

(More samples in gen_samples.log.)

Verified load test

The snippet below is the one in load_test.py in this repo. It was run end-to-end against this published repo on a fresh machine — python load_test.py prints LOAD_TEST: PASS and the sample above.

import importlib.util, json, torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from transformers import AutoTokenizer

REPO = "AlexWortega/ml-intern-v4-100m-tinystories-20260512-1721"
local = snapshot_download(repo_id=REPO)

spec = importlib.util.spec_from_file_location("ds_v4", f"{local}/model.py")
mod  = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)

cfg_dict = json.loads(open(f"{local}/config.json").read())
cfg_dict.pop("_model_class", None)
config = mod.DeepSeekV4Config(**cfg_dict)
model  = mod.DeepSeekV4(config)
model.load_state_dict(load_file(f"{local}/model.safetensors"))
model.eval()

# 1. forward pass
x = torch.randint(0, config.vocab_size, (1, 64))
logits = model(x)
assert logits.shape == (1, 64, config.vocab_size)
assert torch.isfinite(logits).all()

# 2. generation
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Once upon a time, the little girl")
x = torch.tensor([ids])
for _ in range(60):
    next_logits = model(x[:, -config.max_seq_len:])[0, -1] / 0.8
    nxt = torch.multinomial(torch.softmax(next_logits, -1), 1).item()
    x = torch.cat([x, torch.tensor([[nxt]])], dim=1)
print(tok.decode(x[0].tolist()))

Actual stdout from the run on eva02:

[2/5] importing DeepSeekV4 from local model.py ...
      config: dim=512 n_layers=12 n_heads=8 vocab_size=50257 max_seq_len=512
[3/5] building model + loading safetensors ...
      loaded 521 tensors, total params = 128,846,136 (~128.8M)
[4/5] forward pass on random tokens ...
      logits shape=(1, 64, 50257) dtype=torch.float32 finite=True
[5/5] generation from 'Once upon a time, the little girl' ...
LOAD_TEST: PASS

Run summary

metric	value
architecture	DeepSeek-V4 (down-scaled, compressor disabled in v2 fix)
total parameters	128,846,136 (~128.8M)
dataset	`roneneldan/TinyStories` (streaming)
tokenizer	`gpt2` (vocab 50,257)
sequence length	256
batch size	78 (largest fit at bf16 on A6000 after binary-search probe)
precision	bf16, AdamW with fp32 moments
optimizer	AdamW lr=3e-4, β=(0.9, 0.95), weight_decay=0.1, grad_clip=1.0
LR schedule	linear warmup 1000 steps → cosine decay to 3e-5
init train loss	10.94
eval loss @ step 8000	1.328
eval-train gap	~0.04 (healthy tracking)
throughput	~18,400 tok/s on RTX A6000
peak GPU mem	~45 GB
hardware	1× NVIDIA RTX A6000 (46 GB)
skill version	`AlexWortega/claude-ml-intern-skill`

Configuration

See config.json. The corresponding dataclass is in model.py as DeepSeekV4Config. Key fields:

vocab_size       = 50257     # gpt2 tokenizer
dim              = 512
n_layers         = 12
n_heads          = 8
head_dim         = 64
q_lora_rank      = 128
o_lora_rank      = 64
o_groups         = 4
rope_head_dim    = 16
window_size      = 32
compress_ratios  = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)   # CSA disabled (v2 fix)
hc_mult          = 4
hc_sinkhorn_iters = 3
n_routed_experts = 6
n_shared_experts = 1
n_activated_experts = 2
moe_inter_dim    = 512
n_hash_layers    = 1
score_func       = "sqrtsoftplus"
swiglu_limit     = 10.0
n_mtp_layers     = 1
max_seq_len      = 512

Files in this repo

file	what it is
`model.safetensors`	float32 weights (converted from `step_8000.pt`)
`config.json`	actual runtime architecture config
`model.py`	self-contained PyTorch implementation of the V4 architecture
`load_test.py`	end-to-end verified load test (snapshot_download → forward → generate)
`train_v2.py`	the training script that produced this checkpoint
`train.log`	per-step loss, lr, throughput
`eval.log`	per-eval-window evaluation losses
`gen_samples.log`	generation samples at steps 1000 / 5000 / …
`DEBUG.md`	post-mortem of the v1 causal-mask-leak bug and its fix
`TASK.md`	restated task as understood by the agent
`README.md`	this card

Status

This is a mid-training snapshot at step 8000 of an ongoing full-epoch run (planned ~25k steps). At the time this card was written the run had reached step 10000+ with eval_loss=1.30 and a still-monotonically-improving curve. Later checkpoints will be published at adjacent ml-intern timestamps in the same namespace.

Caveats

TinyStories is a toy distribution. Don't expect this model to generalize beyond simple children-story prose.
The model uses a gpt2 tokenizer rather than the DeepSeek tokenizer (the V4 tokenizer is gated and not needed at 130M for TinyStories).
The architecture is faithful to the V4 paper's structural deltas (mHC, hash routing, sqrtsoftplus, swiglu_limit, MTP, attention-sink) but down-scaled aggressively: MoE has 6 routed experts (vs. 384 in V4-Pro), 12 layers (vs. 61), and the CSA Compressor is disabled (compress_ratios=(0,)*12) because v1 attempts produced a causal mask leak via the compressor's pooling kernel — see DEBUG.md.

Reproduce the training run

curl -fsSL https://raw.githubusercontent.com/AlexWortega/claude-ml-intern-skill/main/install.sh | bash
# fill ~/.claude/skills/ml-intern/.env with HF_TOKEN (+ optional TG/Slack), then in any claude session:
claude -p --permission-mode bypassPermissions <<'EOF'
/ml-intern train a DeepSeek-V4 architecture at ~130M parameters on roneneldan/TinyStories for one epoch
EOF

License

Apache 2.0.

Downloads last month: 1,904

Safetensors

Model size

0.1B params

Tensor type

I64

BF16

AlexWortega
/

ml-intern-v4-100m-tinystories-20260512-1721