Aquileo | Harshhvm/bharat-minigpt-350m-pretrain-3b-tokens · Hugging Face

Bharat MiniGPT 350M (3.5B tokens Experiment)

Bharat MiniGPT 350M is a custom GPT-style causal language model trained from scratch by Harshvardhan Mishra using modern LLM architecture components such as RoPE, RMSNorm, SwiGLU, and SDPA Attention.

This is not a fine-tuned GPT-2 or LLaMA variant. The architecture and training pipeline were implemented manually in PyTorch and later integrated into the HuggingFace ecosystem.

A better version with more tokens and fine tune version available SOON.

Best suited for:

Knowledge completion
Educational prompts
Article continuation
English prose generation

This is a pretrained foundation model and is not instruction tuned yet.

Explore More:https://iotbyhvm.ooo/bharat-minigpt-350m-a-custom-gpt-style-llm-built-from-scratch-in-india/

Model Details

  • Model Name: Bharat MiniGPT 350M
  • Parameters: ~350 Million
  • Architecture: Decoder-only Transformer
  • Training Tokens: 3.5 Billion
  • Framework: PyTorch + Custom Hugging Face Transformers integration
  • Developer: Harshvardhan Mishra
  • Organization: HVM Smart Solutions

Architecture

Component Details
Layers 24 Transformer Blocks
Heads 16 Attention Heads
Embedding Size 1024
Context Length 768 Tokens
Vocabulary Size 50,257
Position Encoding RoPE (Rotary Position Embedding)
Normalization RMSNorm
Feed Forward SwiGLU
Attention SDPA / Flash Attention Compatible
Weight Tying Yes
Precision FP16 Training

Training Data

The model was trained using a weighted mixture of:

Dataset Weight
HuggingFaceFW/fineweb (sample-10BT) 40%
HuggingFaceFW/fineweb-edu (sample-10BT) 30%
Wikimedia Wikipedia (20231101.en) 30%

Training Setup

Setting Value
Optimizer AdamW
Learning Rate 3e-4
Min LR 3e-5
Warmup Steps 51,200
LR Scheduler Cosine Decay
Gradient Accumulation 128
Mixed Precision FP16
Gradient Clipping 1.0

Features

  • Custom GPT architecture
  • RoPE positional embeddings
  • RMSNorm normalization
  • SwiGLU feed-forward layers
  • Flash Attention compatible SDPA
  • HuggingFace generate() support
  • KV-cache compatible
  • Weight tying support
  • Gradient checkpointing during training

Benchmark Results

Evaluated using: EleutherAI LM Evaluation Harness

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.3312 ± 0.0097
none 0 acc_norm 0.3413 ± 0.0097
hellaswag 1 none 0 acc 0.2650 ± 0.0044
none 0 acc_norm 0.2636 ± 0.0044
piqa 1 none 0 acc 0.5631 ± 0.0116
none 0 acc_norm 0.5533 ± 0.0116

Notes:

  • Results are from the current 3B tokens pretrained base checkpoint.
  • This model is not instruction-tuned yet.
  • Further tokenizer and training improvements are planned.

Installation

pip install transformers torch

Usage

Load Model from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained( "Harshhvm/bharat-minigpt-350m-pretrain-3b-tokens", trust_remote_code=True )

tokenizer = AutoTokenizer.from_pretrained( "Harshhvm/bharat-minigpt-350m-pretrain-3b-tokens", trust_remote_code=True )

Generate Text

import torch

prompt = "India is a land of"

inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad(): outputs = model.generate( **inputs,

  max_new_tokens=80,
    
  temperature=0.45,
    
  top_p=0.82,
    
  top_k=40,
    
  repetition_penalty=1.35,
    
  no_repeat_ngram_size=4,
    
  do_sample=True,
    
  use_cache=True,
    
  eos_token_id=tokenizer.eos_token_id,
  pad_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Downloads last month
1,941
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Harshhvm/bharat-minigpt-350m-pretrain-3b-tokens

Space using Harshhvm/bharat-minigpt-350m-pretrain-3b-tokens 1