vLLM

A high-throughput and memory-efficient inference and serving engine

This is an exact mirror of the vLLM project, hosted at https://github.com/vllm-project/vllm. SourceForge is not affiliated with vLLM.

Add a Review

Downloads: 19 This Week

Last Update: 18 hours ago

Download

Get an email when there's a new version of vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.

Features

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
Seamless integration with popular HuggingFace models
Tensor parallelism support for distributed inference

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow vLLM

vLLM Web Site

Other Useful Business Software

Couchbase - A Cloud Database Platform For Modern Applications.

Couchbase’s operational data platform for AI is a scalable foundation for enterprise operational, analytical, mobile and AI workloads

Turn your database into the foundation for your business’s next breakthrough, whether you're scaling up, connecting cloud to mobile edge or awakening possibilities in AI.

Learn More

Rate This Project

User Reviews

Be the first to post a review of vLLM!

Additional Project Details

Programming Language

Python

Related Categories

Python Large Language Models (LLM), Python LLM Inference Tool

Registered

2023-08-21

Similar Business Software

LM-Kit.NET

LM-Kit.NET is a complete local AI runtime for .NET that lets engineering teams ship AI-powered features without cloud dependencies, per-token costs, or data leaving the network. Most .NET AI integrations stop at inference. LM-Kit.NET covers the full range of capabilities production...

See Software
Gemini Enterprise Agent Platform

Gemini Enterprise Agent Platform is a comprehensive solution from Google Cloud designed to help organizations build, scale, govern, and optimize AI agents. It represents the evolution of Vertex AI, combining advanced model development with new capabilities for agent orchestration and...

See Software
Google AI Studio

Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3.5. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use...

See Software
vLLM

vLLM is a high-performance library designed to facilitate efficient inference and serving of Large Language Models (LLMs). Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. It offers...

See Software
RunPod

RunPod offers a cloud-based platform designed for running AI workloads, focusing on providing scalable, on-demand GPU resources to accelerate machine learning (ML) model training and inference. With its diverse selection of powerful GPUs like the NVIDIA A100, RTX 3090, and H100, RunPod supports...

See Software
Ministral 3B

Mistral AI introduced two state-of-the-art models for on-device computing and edge use cases, named "les Ministraux": Ministral 3B and Ministral 8B. These models set a new frontier in knowledge, commonsense reasoning, function-calling, and efficiency in the sub-10B category. They can be used or...

See Software

Report inappropriate content

Couchbase - A Cloud Database Platform For Modern Applications.

Couchbase’s operational data platform for AI is a scalable foundation for enterprise operational, analytical, mobile and AI workloads

Turn your database into the foundation for your business’s next breakthrough, whether you're scaling up, connecting cloud to mobile edge or awakening possibilities in AI.

Learn More

Recommended Projects

tiny-llm
A course of learning LLM inference serving on Apple Silicon
Infinity
Low-latency REST API for serving text-embeddings
RTP-LLM
Alibaba's high-performance LLM inference engine for diverse apps
Mosec
A high-performance ML model serving framework, offers dynamic batching
Text Generation Inference
Large Language Model Text Generation Inference