High-Performance LLM Inference, Serving & Training Engine
A Rust-powered engine with hybrid Mamba-3 SSM + Transformer + MoE architectures, GRPO reinforcement learning, TurboQuant KV cache compression, and advanced test-time inference — portable across CPU, CUDA, x86_64 and aarch64.
SwiftLLM combines continuous batching, PagedAttention, TurboQuant KV cache compression, and optimized CUDA kernels to deliver maximum throughput with minimum latency.
Whether you need an OpenAI-style API server, batch processing, interactive chat, RL fine-tuning with GRPO, or hybrid Mamba-3 model architectures, SwiftLLM provides a unified, production-ready solution.
Rust core via PyO3 + maturin abi3 wheels for Python 3.8–3.12. PyTorch is optional — SDK imports cleanly without it
CPU is default — CUDA opt-in with --gpu. Ships x86_64 and aarch64 wheels for Linux and macOS
Bundle wheels, source, and models via airgap-bundle.sh. SHA256 verification + cross-architecture support
install.sh, update.sh, uninstall.sh — auto-detect GPU, switch branches/tags, purge cleanly
Constant-time API key auth, CORS from config, authenticated /metrics, CUDA error checking on every kernel launch
from swiftllm import LLM, SamplingParams # Load model with automatic GPU detection llm = LLM("meta-llama/Llama-3-8B") # Configure sampling params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=256, ) # Generate with self-consistency voting results = llm.generate_with_self_consistency( "What is 12 x 15?", num_samples=8, ) print(results[0].answer) # Or standard generation outputs = llm.generate( ["Explain quantum computing"], params, ) for output in outputs: print(output.text)
A complete inference, serving, and training stack — from a single pip install.
Continuous batching with preemption-capable scheduling. New requests slot into the next decoder step without blocking long-running generations.
Block-allocator KV cache with copy-on-write. Dynamic block tables eliminate fragmentation and enable near-100% VRAM utilization.
Draft-model acceleration with O(n) quickselect sampling. Reduces latency by validating multiple draft tokens per forward pass.
Tensor parallelism and pipeline parallelism scale seamlessly across NVIDIA GPUs. Automatic sharding with configurable --tensor-parallel.
Drop-in replacement with /v1/completions, /v1/chat/completions, streaming SSE, API key auth, CORS, and Prometheus /metrics. Default bind is 127.0.0.1 with security hardening.
HuggingFace repos, GGUF quantized models (via llama-cpp-python), SafeTensors, and PyTorch .bin formats — all with auto-detection and seamless loading.
Create portable bundles with airgap-bundle.sh — source, wheels, rustup-init, and optional models in one tarball. SHA256 verification and --arch cross-compilation.
Ships portable abi3 wheels for manylinux2014_x86_64, manylinux2014_aarch64, macosx_10_15_x86_64, and macosx_11_0_arm64. Works on Graviton, Jetson, Raspberry Pi, and Apple Silicon.
install.sh auto-detects GPU and builds from source. update.sh switches branches/tags and rebuilds. uninstall.sh cleanly removes everything with --purge or --keep-models.
LoRA/QLoRA fine-tuning, GRPO reinforcement learning, multi-format dataset ingestion, and HuggingFace Hub integration.
Memory-efficient adapter training or full parameter updates. Muon (Newton-Schulz orthogonalization), AdamW (decoupled weight decay), and SGD (Nesterov momentum) optimizers with linear, cosine, and constant LR schedulers.
Group Relative Policy Optimization (DeepSeekMath, Shao 2024) — RL fine-tuning without a critic model. Generates G rollouts per prompt, computes group-relative advantages, applies PPO-style clipped policy gradient with KL penalty.
Curriculum-Guided Adaptive Recursion — progressive depth training from shallow to full. Smooth Hermite interpolation at phase boundaries. Up to 1.71x speedup vs. full-depth from the start.
Step-level reasoning feedback via RulePrm (heuristic) or NeuralPrm (learned verifier). Five aggregation strategies: Min, Mean, Product, LastStep, WeightedMean. Blended with outcome rewards.
Per-token relative NLL information gain vs. a frozen reference model. Provides dense token-level rewards for long-context tasks — +9% improvement on LongBench v2 in the original paper.
One command converts directories of .txt, .md, .py, .rs, .pdf, .docx, .csv, .html, .jsonl and 40+ formats into JSONL. Four output schemas, SHA-256 dedup, code-aware chunking.
Pull any dataset from the Hub with --hf-dataset or HuggingFaceSource. Auto-detects Alpaca, ShareGPT, OpenAI-messages, prompt/completion, Q&A, and plain text. Streaming mode for large corpora.
Pass a directory or HuggingFace dataset to Trainer or fine_tune() — ingestion fires transparently. No separate preprocessing step needed.
Save/load checkpoints with configurable save_total_limit rotation. EarlyStoppingConfig monitors validation loss. Rolling-window metrics and perplexity tracking throughout training.
Six research-derived inference enhancements, TurboQuant memory compression, and hybrid SSM architectures — all accessible via Python API and CLI.
ICLR 2026 (Zandieh et al.) online vector quantization for KV cache compression. Random rotation via fast Walsh-Hadamard transform with deterministic sign flips, then scalar quantization using precomputed Beta-distribution Max-Lloyd codebooks at 1-4 bits per channel. TurboQuantMse (MSE-optimal) and TurboQuantProd (unbiased inner products) variants. 3-5x memory reduction.
Selective SSM with MIMO multi-head scan, complex-valued states, and exponential-trapezoidal discretization. LatentMoeConfig enables compress-dispatch-expand MoE with aux-loss-free dynamic-bias load balancing (87.5% less inter-GPU traffic). Jamba-style interleaved Attention + Mamba blocks.
Generates N independent reasoning chains at temperature > 0, extracts a final answer via configurable extractor (regex, sentinel, JSON, freeform), and returns the plurality-majority answer. Ties broken by mean sequence log-probability. (Wang et al., 2022)
Iterative critique-revision loop. Each round: model critiques its own output, then produces a revised version. Stops when improvement (normalized Levenshtein edit distance) falls below threshold or round limit is reached. O(min(m,n)) 2-row DP. (Madaan et al., 2023)
Generates N candidate responses, scores each using rule-based heuristic, neural PRM, ensemble, or sequence log-prob strategy, then returns the highest-scoring candidate. Mirrors verify_and_rank() in the Rust core.
Separates compute-bound prefill and bandwidth-bound decode into dedicated, independently-scaled worker pools (Splitwise/DistServe). Round-robin, least-loaded, and locality-aware scheduling. optimal_worker_ratio() auto-sizes pools.
Bounded recursive self-calling for complex sub-problem decomposition. REPL sandbox with four step types: Assign, Compute, Verify, Recurse. Variable binding table via soft-attention key-value store. Complexity-classifier MLP with early exit. Modes: DISABLED, SHALLOW (depth=1), REASONING (depth=3), AGENTIC (depth=5).
Cross-attention draft-token scoring against embedded REPL execution traces. Per-token and per-step confidence scores via sigmoid projection. Strategies: SCORE_ONLY (always accept), GATE (reject below threshold), GATE_AND_REGEN (reject and regenerate up to max attempts).
Dedicated .cu kernels for Mamba-3 parallel scan, MoE sparse dispatch, dense verification cross-attention, RLM operations, and half-precision matrix multiplications — all with check_cuda_last_error() after every launch.
Run the most popular open-source model families out of the box — from 130M-parameter Mamba SSMs to 100B+ MoE hybrids. Load from HuggingFace, GGUF, or SafeTensors.
| Architecture | Models | Notes |
|---|---|---|
| LLaMA | LLaMA 3, Code Llama | |
| Mistral | Mistral 7B, Mixtral 8x7B, Devstral | Mixtral uses MoE FFN |
| Qwen | Qwen 3, Qwen 3.5 | |
| Phi | Phi-3, Phi-4 | |
| Deepseek | R1, V4 | |
| Gemma | Gemma | |
| Mamba | Mamba-130M through Mamba-3B | Phase 1 — pure SSM |
| Jamba | Jamba-v0.1, custom hybrids | Phase 1 — Attention + Mamba + MoE |
Formats supported: HuggingFace transformers repos, GGUF quantized models (via llama-cpp-python), SafeTensors, and PyTorch .bin checkpoints with automatic detection.
Modular Rust workspace with clean separation of concerns. Each crate is independently testable; the workspace builds with cargo build --workspace.
Python SDK wraps everything via PyO3 bindings. The top-level swiftllm crate produces a cdylib wheel with abi3-py38 — one wheel covers Python 3.8 through 3.12. Release profile uses thin LTO, codegen-units = 1, and opt-level = 3 for maximum performance.
Portable abi3 wheels for Python 3.8–3.12. CPU is the default; CUDA 11.8+ is opt-in. Build on your dev laptop, deploy to Graviton, Raspberry Pi, Jetson, or Apple Silicon.
| Platform | x86_64 | aarch64 / arm64 |
|---|---|---|
| Linux | manylinux2014_x86_64 | manylinux2014_aarch64 |
| macOS | macosx_10_15_x86_64 | macosx_11_0_arm64 (M1+) |
--cpu CPU-only · --gpu force CUDA · --venv ~/sllm custom venv · --no-venv system install · --model-dir set cache · --airgap offline
airgap-bundle.sh creates portable archives with source, pip wheels, rustup-init, and optional models. --arch flag cross-compiles for different targets. SHA256 verification on all components.
update.sh --branch / --tag to switch releases and rebuild. uninstall.sh --purge removes everything or --keep-models to preserve downloads.
# Quick install (auto-detects GPU) $ ./install.sh # CPU-only for ARM deployment $ ./install.sh --cpu # Create air-gap bundle $ ./airgap-bundle.sh --arch aarch64 \ --model "Qwen/Qwen2.5-0.5B-Instruct-GGUF" # Deploy on air-gapped host $ ./install.sh --airgap # Runtime offline mode $ export SWIFTLLM_OFFLINE=1 $ swiftllm generate -m ./model.gguf -p "Hello"
From clone to production serving in four commands. Fine-tune, ingest datasets, or run advanced inference with just a few more.
Installer auto-detects GPU and CUDA toolkit, creates a Python venv, installs Rust if needed, builds the wheel with maturin, and installs [serve] extras for the API server.
Pull any supported model from HuggingFace Hub — standard weights, GGUF quantized variants, or SafeTensors checkpoints. Supports --revision pinning and HF transfer acceleration.
Batch generate, interactive swiftllm chat REPL, or pipe prompts from a file. Add --self-consistency 8 for majority voting, --best-of-n 4 for dense verification, or --rlm 3 for recursive reasoning.
OpenAI-compatible HTTP server with API key auth (via SWIFTLLM_API_KEY), CORS, SSE streaming, /health, and Prometheus /metrics. Default bind is 127.0.0.1 for security.
Fine-tune with LoRA, run GRPO reinforcement learning, or ingest datasets from HuggingFace and local files.
LoRA adapter training with configurable rank, alpha, and dropout. Supports Muon, AdamW, and SGD optimizers with cosine/linear/constant LR scheduling.
Convert directories of text, code, PDF, DOCX, CSV, HTML, JSON (and 40+ more formats) into JSONL training data — or pull directly from HuggingFace Hub with auto-detected schemas.
Group Relative Policy Optimization — RL fine-tuning without a critic model. Combine with CGAR curriculum scheduling, Process Reward Models, and LongR dense rewards.
Pull Alpaca, ShareGPT, OpenAI-messages, or any custom schema directly from the Hub. Streaming mode for large corpora like FineWeb and RedPajama.
SwiftLLM v2.0 ships hybrid Mamba-3 architectures, GRPO training, TurboQuant KV cache compression, and six test-time inference enhancements — all in one Rust-powered package.