Aquileo | AlexWortega/ml-intern-v4-100m-tinystories-20260512-1721 · Hugging Face

DeepSeek-V4 @ ~130M (TinyStories)

Trained autonomously by the ml-intern Claude Code skill on a single RTX A6000.

This is a ~130M-parameter down-scale of the DeepSeek-V4 architecture (MQA + 512-head_dim semantics, mHC manifold-constrained hyper-connections, hash routing, sqrtsoftplus gate, swiglu_limit, O grouped low-rank, MTP, attention-sink), trained on the roneneldan/TinyStories dataset.

The whole pipeline — paper extraction, architectural down-scale, training script, bug-fixing (a real causal mask leak via the CSA Compressor was caught during a v1 run and patched in v2 by disabling the compressor), self-verification, and this very repo — was produced by a Claude Code session running the ml-intern skill with no human in the loop after the initial prompt.

Sample generation

Prompt: "Once upon a time, the little girl"

Once upon a time, the little girl was walking through the woods. She saw lots of interesting things and started to explore. Suddenly, she saw a big rock! She was so excited and said, "Wow! I'm so hungry!" Just then, a friendly fox came buzzing by. The fox said, "I have something..."

(More samples in gen_samples.log.)

Verified load test

The snippet below is the one in load_test.py in this repo. It was run end-to-end against this published repo on a fresh machine — python load_test.py prints LOAD_TEST: PASS and the sample above.

import importlib.util, json, torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from transformers import AutoTokenizer

REPO = "AlexWortega/ml-intern-v4-100m-tinystories-20260512-1721"
local = snapshot_download(repo_id=REPO)

spec = importlib.util.spec_from_file_location("ds_v4", f"{local}/model.py")
mod  = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)

cfg_dict = json.loads(open(f"{local}/config.json").read())
cfg_dict.pop("_model_class", None)
config = mod.DeepSeekV4Config(**cfg_dict)
model  = mod.DeepSeekV4(config)
model.load_state_dict(load_file(f"{local}/model.safetensors"))
model.eval()

# 1. forward pass
x = torch.randint(0, config.vocab_size, (1, 64))
logits = model(x)
assert logits.shape == (1, 64, config.vocab_size)
assert torch.isfinite(logits).all()

# 2. generation
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Once upon a time, the little girl")
x = torch.tensor([ids])
for _ in range(60):
    next_logits = model(x[:, -config.max_seq_len:])[0, -1] / 0.8
    nxt = torch.multinomial(torch.softmax(next_logits, -1), 1).item()
    x = torch.cat([x, torch.tensor([[nxt]])], dim=1)
print(tok.decode(x[0].tolist()))

Actual stdout from the run on eva02:

[2/5] importing DeepSeekV4 from local model.py ...
      config: dim=512 n_layers=12 n_heads=8 vocab_size=50257 max_seq_len=512
[3/5] building model + loading safetensors ...
      loaded 521 tensors, total params = 128,846,136 (~128.8M)
[4/5] forward pass on random tokens ...
      logits shape=(1, 64, 50257) dtype=torch.float32 finite=True
[5/5] generation from 'Once upon a time, the little girl' ...
LOAD_TEST: PASS

Run summary

metric value
architecture DeepSeek-V4 (down-scaled, compressor disabled in v2 fix)
total parameters 128,846,136 (~128.8M)
dataset roneneldan/TinyStories (streaming)
tokenizer gpt2 (vocab 50,257)
sequence length 256
batch size 78 (largest fit at bf16 on A6000 after binary-search probe)
precision bf16, AdamW with fp32 moments
optimizer AdamW lr=3e-4, β=(0.9, 0.95), weight_decay=0.1, grad_clip=1.0
LR schedule linear warmup 1000 steps → cosine decay to 3e-5
init train loss 10.94
eval loss @ step 8000 1.328
eval-train gap ~0.04 (healthy tracking)
throughput ~18,400 tok/s on RTX A6000
peak GPU mem ~45 GB
hardware 1× NVIDIA RTX A6000 (46 GB)
skill version AlexWortega/claude-ml-intern-skill

Configuration

See config.json. The corresponding dataclass is in model.py as DeepSeekV4Config. Key fields:

vocab_size       = 50257     # gpt2 tokenizer
dim              = 512
n_layers         = 12
n_heads          = 8
head_dim         = 64
q_lora_rank      = 128
o_lora_rank      = 64
o_groups         = 4
rope_head_dim    = 16
window_size      = 32
compress_ratios  = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)   # CSA disabled (v2 fix)
hc_mult          = 4
hc_sinkhorn_iters = 3
n_routed_experts = 6
n_shared_experts = 1
n_activated_experts = 2
moe_inter_dim    = 512
n_hash_layers    = 1
score_func       = "sqrtsoftplus"
swiglu_limit     = 10.0
n_mtp_layers     = 1
max_seq_len      = 512

Files in this repo

file what it is
model.safetensors float32 weights (converted from step_8000.pt)
config.json actual runtime architecture config
model.py self-contained PyTorch implementation of the V4 architecture
load_test.py end-to-end verified load test (snapshot_download → forward → generate)
train_v2.py the training script that produced this checkpoint
train.log per-step loss, lr, throughput
eval.log per-eval-window evaluation losses
gen_samples.log generation samples at steps 1000 / 5000 / …
DEBUG.md post-mortem of the v1 causal-mask-leak bug and its fix
TASK.md restated task as understood by the agent
README.md this card

Status

This is a mid-training snapshot at step 8000 of an ongoing full-epoch run (planned ~25k steps). At the time this card was written the run had reached step 10000+ with eval_loss=1.30 and a still-monotonically-improving curve. Later checkpoints will be published at adjacent ml-intern timestamps in the same namespace.

Caveats

  • TinyStories is a toy distribution. Don't expect this model to generalize beyond simple children-story prose.
  • The model uses a gpt2 tokenizer rather than the DeepSeek tokenizer (the V4 tokenizer is gated and not needed at 130M for TinyStories).
  • The architecture is faithful to the V4 paper's structural deltas (mHC, hash routing, sqrtsoftplus, swiglu_limit, MTP, attention-sink) but down-scaled aggressively: MoE has 6 routed experts (vs. 384 in V4-Pro), 12 layers (vs. 61), and the CSA Compressor is disabled (compress_ratios=(0,)*12) because v1 attempts produced a causal mask leak via the compressor's pooling kernel — see DEBUG.md.

Reproduce the training run

curl -fsSL https://raw.githubusercontent.com/AlexWortega/claude-ml-intern-skill/main/install.sh | bash
# fill ~/.claude/skills/ml-intern/.env with HF_TOKEN (+ optional TG/Slack), then in any claude session:
claude -p --permission-mode bypassPermissions <<'EOF'
/ml-intern train a DeepSeek-V4 architecture at ~130M parameters on roneneldan/TinyStories for one epoch
EOF

License

Apache 2.0.

Downloads last month
1,904
Safetensors
Model size
0.1B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AlexWortega/ml-intern-v4-100m-tinystories-20260512-1721