SwiftLLM

0

Model Families

0

Rust Crates

0

Backend Tests

0

Frontend Tests

0

Research Phases

0

Wheel Targets

Built for performance.
Designed for developers.

SwiftLLM combines continuous batching, PagedAttention, TurboQuant KV cache compression, and optimized CUDA kernels to deliver maximum throughput with minimum latency.

Whether you need an OpenAI-style API server, batch processing, interactive chat, RL fine-tuning with GRPO, or hybrid Mamba-3 model architectures, SwiftLLM provides a unified, production-ready solution.

Rust + Python HybridRust core via PyO3 + maturin abi3 wheels for Python 3.8–3.12. PyTorch is optional — SDK imports cleanly without it

CPU, CUDA, ARMCPU is default — CUDA opt-in with --gpu. Ships x86_64 and aarch64 wheels for Linux and macOS

Air-Gapped DeploymentBundle wheels, source, and models via airgap-bundle.sh. SHA256 verification + cross-architecture support

Lifecycle Managementinstall.sh, update.sh, uninstall.sh — auto-detect GPU, switch branches/tags, purge cleanly

Security HardenedConstant-time API key auth, CORS from config, authenticated /metrics, CUDA error checking on every kernel launch

quickstart.py

from swiftllm import LLM, SamplingParams

# Load model with automatic GPU detection
llm = LLM("meta-llama/Llama-3-8B")

# Configure sampling
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Generate with self-consistency voting
results = llm.generate_with_self_consistency(
    "What is 12 x 15?",
    num_samples=8,
)
print(results[0].answer)

# Or standard generation
outputs = llm.generate(
    ["Explain quantum computing"],
    params,
)
for output in outputs:
    print(output.text)

Capabilities — Core Engine

Powerful features, built-in.

A complete inference, serving, and training stack — from a single pip install.

High Throughput

Continuous batching with preemption-capable scheduling. New requests slot into the next decoder step without blocking long-running generations.

PagedAttention

Block-allocator KV cache with copy-on-write. Dynamic block tables eliminate fragmentation and enable near-100% VRAM utilization.

Speculative Decoding

Draft-model acceleration with O(n) quickselect sampling. Reduces latency by validating multiple draft tokens per forward pass.

Multi-GPU Scaling

Tensor parallelism and pipeline parallelism scale seamlessly across NVIDIA GPUs. Automatic sharding with configurable --tensor-parallel.

OpenAI-Compatible API

Drop-in replacement with /v1/completions, /v1/chat/completions, streaming SSE, API key auth, CORS, and Prometheus /metrics. Default bind is 127.0.0.1 with security hardening.

Multi-Format Models

HuggingFace repos, GGUF quantized models (via llama-cpp-python), SafeTensors, and PyTorch .bin formats — all with auto-detection and seamless loading.

Air-Gapped Install

Create portable bundles with airgap-bundle.sh — source, wheels, rustup-init, and optional models in one tarball. SHA256 verification and --arch cross-compilation.

CPU & ARM Ready

Ships portable abi3 wheels for manylinux2014_x86_64, manylinux2014_aarch64, macosx_10_15_x86_64, and macosx_11_0_arm64. Works on Graviton, Jetson, Raspberry Pi, and Apple Silicon.

Lifecycle Scripts

install.sh auto-detects GPU and builds from source. update.sh switches branches/tags and rebuilds. uninstall.sh cleanly removes everything with --purge or --keep-models.

Capabilities — Training & Data

Train, fine-tune, and ingest.

LoRA/QLoRA fine-tuning, GRPO reinforcement learning, multi-format dataset ingestion, and HuggingFace Hub integration.

LoRA / QLoRA / Full Fine-Tuning

Memory-efficient adapter training or full parameter updates. Muon (Newton-Schulz orthogonalization), AdamW (decoupled weight decay), and SGD (Nesterov momentum) optimizers with linear, cosine, and constant LR schedulers.

GRPO Reinforcement Learning

Group Relative Policy Optimization (DeepSeekMath, Shao 2024) — RL fine-tuning without a critic model. Generates G rollouts per prompt, computes group-relative advantages, applies PPO-style clipped policy gradient with KL penalty.

CGAR Depth Curriculum

Curriculum-Guided Adaptive Recursion — progressive depth training from shallow to full. Smooth Hermite interpolation at phase boundaries. Up to 1.71x speedup vs. full-depth from the start.

Process Reward Models

Step-level reasoning feedback via RulePrm (heuristic) or NeuralPrm (learned verifier). Five aggregation strategies: Min, Mean, Product, LastStep, WeightedMean. Blended with outcome rewards.

LongR Dense Rewards

Per-token relative NLL information gain vs. a frozen reference model. Provides dense token-level rewards for long-context tasks — +9% improvement on LongBench v2 in the original paper.

Dataset Ingestion

One command converts directories of .txt, .md, .py, .rs, .pdf, .docx, .csv, .html, .jsonl and 40+ formats into JSONL. Four output schemas, SHA-256 dedup, code-aware chunking.

HuggingFace Datasets

Pull any dataset from the Hub with --hf-dataset or HuggingFaceSource. Auto-detects Alpaca, ShareGPT, OpenAI-messages, prompt/completion, Q&A, and plain text. Streaming mode for large corpora.

Auto-Ingest in Trainer

Pass a directory or HuggingFace dataset to Trainer or fine_tune() — ingestion fires transparently. No separate preprocessing step needed.

Checkpoints & Early Stopping

Save/load checkpoints with configurable save_total_limit rotation. EarlyStoppingConfig monitors validation loss. Rolling-window metrics and perplexity tracking throughout training.

Capabilities — Inference

Advanced test-time compute.

Six research-derived inference enhancements, TurboQuant memory compression, and hybrid SSM architectures — all accessible via Python API and CLI.

TurboQuant KV Cache

ICLR 2026 (Zandieh et al.) online vector quantization for KV cache compression. Random rotation via fast Walsh-Hadamard transform with deterministic sign flips, then scalar quantization using precomputed Beta-distribution Max-Lloyd codebooks at 1-4 bits per channel. TurboQuantMse (MSE-optimal) and TurboQuantProd (unbiased inner products) variants. 3-5x memory reduction.

Hybrid Mamba-3 Architecture

Selective SSM with MIMO multi-head scan, complex-valued states, and exponential-trapezoidal discretization. LatentMoeConfig enables compress-dispatch-expand MoE with aux-loss-free dynamic-bias load balancing (87.5% less inter-GPU traffic). Jamba-style interleaved Attention + Mamba blocks.

Self-Consistency Voting

Generates N independent reasoning chains at temperature > 0, extracts a final answer via configurable extractor (regex, sentinel, JSON, freeform), and returns the plurality-majority answer. Ties broken by mean sequence log-probability. (Wang et al., 2022)

Multi-Round Self-Refinement

Iterative critique-revision loop. Each round: model critiques its own output, then produces a revised version. Stops when improvement (normalized Levenshtein edit distance) falls below threshold or round limit is reached. O(min(m,n)) 2-row DP. (Madaan et al., 2023)

Best-of-N Dense Verification

Generates N candidate responses, scores each using rule-based heuristic, neural PRM, ensemble, or sequence log-prob strategy, then returns the highest-scoring candidate. Mirrors verify_and_rank() in the Rust core.

Disaggregated Serving

Separates compute-bound prefill and bandwidth-bound decode into dedicated, independently-scaled worker pools (Splitwise/DistServe). Round-robin, least-loaded, and locality-aware scheduling. optimal_worker_ratio() auto-sizes pools.

Recursive Language Model (RLM)

Bounded recursive self-calling for complex sub-problem decomposition. REPL sandbox with four step types: Assign, Compute, Verify, Recurse. Variable binding table via soft-attention key-value store. Complexity-classifier MLP with early exit. Modes: DISABLED, SHALLOW (depth=1), REASONING (depth=3), AGENTIC (depth=5).

Dense Verification Layer

Cross-attention draft-token scoring against embedded REPL execution traces. Per-token and per-step confidence scores via sigmoid projection. Strategies: SCORE_ONLY (always accept), GATE (reject below threshold), GATE_AND_REGEN (reject and regenerate up to max attempts).

Custom CUDA Kernels

Dedicated .cu kernels for Mamba-3 parallel scan, MoE sparse dispatch, dense verification cross-attention, RLM operations, and half-precision matrix multiplications — all with check_cuda_last_error() after every launch.

Compatibility

10 model families. Every major format.

Run the most popular open-source model families out of the box — from 130M-parameter Mamba SSMs to 100B+ MoE hybrids. Load from HuggingFace, GGUF, or SafeTensors.

Architecture	Models	Notes
LLaMA	LLaMA 3, Code Llama
Mistral	Mistral 7B, Mixtral 8x7B, Devstral	Mixtral uses MoE FFN
Qwen	Qwen 3, Qwen 3.5
Phi	Phi-3, Phi-4
Deepseek	R1, V4
Gemma	Gemma
Mamba	Mamba-130M through Mamba-3B	Phase 1 — pure SSM
Jamba	Jamba-v0.1, custom hybrids	Phase 1 — Attention + Mamba + MoE

Formats supported: HuggingFace transformers repos, GGUF quantized models (via llama-cpp-python), SafeTensors, and PyTorch .bin checkpoints with automatic detection.

Architecture

Five crates. One engine.

Modular Rust workspace with clean separation of concerns. Each crate is independently testable; the workspace builds with cargo build --workspace.

swiftllm-core

Engine, scheduler, PagedAttention, TurboQuant KV cache, sampling (self-consistency, refinement, verification), disaggregated serving

swiftllm-models

Model loading (HF, GGUF, SafeTensors), Mamba SSM layers, MoE routing, Jamba hybrid blocks, RLM, Dense Verification

swiftllm-cuda

Custom CUDA kernels: mamba3_scan, moe_dispatch, dense_verif_attn, rlm_ops, linear_f16 — with error checking on every launch

swiftllm-server

Axum HTTP server: /v1/completions, /v1/chat/completions, SSE streaming, API key auth, CORS, Prometheus /metrics

swiftllm-training

LoRA/QLoRA/full fine-tuning, GRPO RL, CGAR curriculum, PRM step scoring, LongR dense rewards, checkpoint management

Python SDK wraps everything via PyO3 bindings. The top-level swiftllm crate produces a cdylib wheel with abi3-py38 — one wheel covers Python 3.8 through 3.12. Release profile uses thin LTO, codegen-units = 1, and opt-level = 3 for maximum performance.

Portability

Runs everywhere.

Portable abi3 wheels for Python 3.8–3.12. CPU is the default; CUDA 11.8+ is opt-in. Build on your dev laptop, deploy to Graviton, Raspberry Pi, Jetson, or Apple Silicon.

Platform	x86_64	aarch64 / arm64
Linux	manylinux2014_x86_64	manylinux2014_aarch64
macOS	macosx_10_15_x86_64	macosx_11_0_arm64 (M1+)

Installer Options--cpu CPU-only · --gpu force CUDA · --venv ~/sllm custom venv · --no-venv system install · --model-dir set cache · --airgap offline
Air-Gap Deploymentairgap-bundle.sh creates portable archives with source, pip wheels, rustup-init, and optional models. --arch flag cross-compiles for different targets. SHA256 verification on all components.
Update & Uninstallupdate.sh --branch / --tag to switch releases and rebuild. uninstall.sh --purge removes everything or --keep-models to preserve downloads.

install & deploy

# Quick install (auto-detects GPU)
$ ./install.sh

# CPU-only for ARM deployment
$ ./install.sh --cpu

# Create air-gap bundle
$ ./airgap-bundle.sh --arch aarch64 \
    --model "Qwen/Qwen2.5-0.5B-Instruct-GGUF"

# Deploy on air-gapped host
$ ./install.sh --airgap

# Runtime offline mode
$ export SWIFTLLM_OFFLINE=1
$ swiftllm generate -m ./model.gguf -p "Hello"

Get Started

Up and running in minutes.

From clone to production serving in four commands. Fine-tune, ingest datasets, or run advanced inference with just a few more.

1

Clone & install

Installer auto-detects GPU and CUDA toolkit, creates a Python venv, installs Rust if needed, builds the wheel with maturin, and installs [serve] extras for the API server.

$git clone https://github.com/swiftllm/swiftllm.git && cd swiftllm && ./install.sh

2

Download a model

Pull any supported model from HuggingFace Hub — standard weights, GGUF quantized variants, or SafeTensors checkpoints. Supports --revision pinning and HF transfer acceleration.

$swiftllm download -m Qwen/Qwen2.5-7B

3

Run inference

Batch generate, interactive swiftllm chat REPL, or pipe prompts from a file. Add --self-consistency 8 for majority voting, --best-of-n 4 for dense verification, or --rlm 3 for recursive reasoning.

$swiftllm generate -p "Explain PagedAttention" --self-consistency 8

4

Serve an API

OpenAI-compatible HTTP server with API key auth (via SWIFTLLM_API_KEY), CORS, SSE streaming, /health, and Prometheus /metrics. Default bind is 127.0.0.1 for security.

$swiftllm serve --port 8000

Go Further

Training, datasets, and more.

Fine-tune with LoRA, run GRPO reinforcement learning, or ingest datasets from HuggingFace and local files.

5

Fine-tune with LoRA

LoRA adapter training with configurable rank, alpha, and dropout. Supports Muon, AdamW, and SGD optimizers with cosine/linear/constant LR scheduling.

$swiftllm finetune -m meta-llama/Llama-3-8B --train-data ./data.jsonl --lora-r 16

6

Ingest a dataset

Convert directories of text, code, PDF, DOCX, CSV, HTML, JSON (and 40+ more formats) into JSONL training data — or pull directly from HuggingFace Hub with auto-detected schemas.

$swiftllm dataset --input ./src/ ./docs/ --output train.jsonl --format code

7

GRPO RL training

Group Relative Policy Optimization — RL fine-tuning without a critic model. Combine with CGAR curriculum scheduling, Process Reward Models, and LongR dense rewards.

$swiftllm grpo -m meta-llama/Llama-3-8B --group-size 8 --kl-coeff 0.05

8

HuggingFace datasets

Pull Alpaca, ShareGPT, OpenAI-messages, or any custom schema directly from the Hub. Streaming mode for large corpora like FineWeb and RedPajama.

$swiftllm dataset --hf-dataset tatsu-lab/alpaca --output alpaca.jsonl --format sft_messages

Built for performance.Designed for developers.

Rust + Python Hybrid

CPU, CUDA, ARM

Air-Gapped Deployment

Lifecycle Management

Security Hardened