Skip to content
View aserputov's full-sized avatar
🪢
Focusing
🪢
Focusing
  • BMO
  • Toronto, Canada

Block or report aserputov

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
aserputov/README.md

header

Gmail GitHub

> whoami

Software Engineer building LLM infrastructure at scale. I design autonomous agent frameworks, optimize inference pipelines, and build the systems that make large language models actually work in production.

Currently architecting LLM agent systems with Model Context Protocol (MCP) — from code-generation compilers to root-cause analysis agents powered by Claude API. Experienced in shipping high-throughput distributed systems with ML-driven workloads.

When I'm not at work, I'm benchmarking inference engines, profiling KV cache bottlenecks, and quantizing models on Apple Metal.


What I'm Building

LLM Inference Benchmark

Benchmarking framework measuring inference throughput (tokens/sec), per-token latency, and memory footprint across model sizes. Covers INT4/8-bit quantization on Apple Metal GPU, thread-scaling analysis, and sub-linear OPS parallelization for decode vs near-linear scaling for prefill.

Python llama.cpp Metal GGUF

Next-Token Prediction Engine

Frequency-based language model implementing core next-token prediction from scratch — n-gram statistics, vocabulary search, probability ranking. Beam-search inspired ranking algorithm that surfaces high-probability completions in O(log n) time.

Python NLP Probability Beam Search

Pinned Loading

  1. infra-test infra-test Public

    C# 1

  2. realtimechat realtimechat Public

    Kafka&node.js&sockets.

    JavaScript 1

  3. iotLightBulb iotLightBulb Public

    C++ 1

  4. inference-engine inference-engine Public

    From-scratch inference engine for GPT-2. KV-cache (2.6x speedup), streaming generation, benchmarks. No inference libraries.

    Python

  5. llm-inference-benchmark llm-inference-benchmark Public

    Systematic LLM inference benchmarking framework using llama.cpp — measures throughput, latency, memory across model sizes and configurations

    Python

  6. vllm-project/vllm vllm-project/vllm Public

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Python 83.3k 18.2k