Installation
Auto-detect GPU, install Rust, build from source.
Quick Start
Your first generation in under five minutes.
API Server
OpenAI-compatible endpoints with auth and metrics.
Training & Fine-Tuning
LoRA, QLoRA, GRPO RL, dataset ingestion, and full fine-tuning.
Installation
SwiftLLM provides a smart installer that auto-detects your system, installs dependencies, and builds the engine from source. As of v2.0 it ships portable abi3 wheels for Linux (x86_64, aarch64) and macOS (x86_64, arm64) — a single wheel covers Python 3.8 through 3.12.
Prerequisites
- Python 3.8+ (3.10+ recommended)
- Git and a C toolchain (cc/clang)
- NVIDIA GPU with CUDA 11.8+ — optional; pass
--cpufor CPU-only builds - Rust toolchain — installed automatically if missing
# Clone the repository git clone https://github.com/infrabrew/swiftllm.git cd swiftllm # Run the installer (auto-detects CUDA vs CPU) bash install.sh
Install Options
| Flag | Description |
|---|---|
--cpu |
Force CPU-only build, skip CUDA detection |
--gpu |
Force GPU/CUDA build, fail fast if CUDA is missing |
--venv DIR |
Custom virtual environment location |
--no-venv |
Install into current Python environment instead of creating a venv |
--model-dir DIR |
Set model storage directory |
--airgap |
Install from a pre-built offline bundle (no network required) |
PEP 668 / externally-managed environments: On Debian 12+ and recent macOS, the installer
will refuse to pip-install into the system Python. Let install.sh create a venv, or pass
--venv.
Platforms
SwiftLLM ships pre-built wheels for the matrix below. Other platforms fall back to a source build, which the installer handles automatically.
| OS | Architecture | Acceleration | Wheel |
|---|---|---|---|
| Linux | x86_64 | CUDA 11.8+ / CPU | ✓ pre-built |
| Linux | aarch64 | CPU (CUDA via source) | ✓ pre-built |
| macOS | x86_64 | CPU | ✓ pre-built |
| macOS | arm64 (Apple Silicon) | CPU + Metal (experimental) | ✓ pre-built |
| Windows | x86_64 | CUDA / CPU | WSL2 recommended |
GPU requirements
- NVIDIA: Compute Capability 7.0+ (Volta and later). Tested on A100, H100, L4, RTX 4090, RTX 3090.
- VRAM: 16 GB minimum for 7B models in fp16; 24 GB+ recommended for serving with KV-cache headroom.
- Drivers: NVIDIA driver 525+ for CUDA 12, 470+ for CUDA 11.8.
Apple Silicon Metal acceleration is in preview. CPU-only mode is the default on macOS — opt-in with
SWIFTLLM_BACKEND=metal.
Quick Start
Basic Inference
from swiftllm import LLM, SamplingParams llm = LLM("meta-llama/Llama-3-8B") params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["What is machine learning?"], params) for output in outputs: print(output.text)
Self-Consistency Voting
from swiftllm import LLM from swiftllm.config import SelfConsistencyConfig llm = LLM("meta-llama/Llama-3-8B") results = llm.generate_with_self_consistency( "What is 12 x 15?", config=SelfConsistencyConfig(num_samples=8), ) print(results[0].answer) # majority-voted answer
TurboQuant KV Cache Compression
from swiftllm import LLM, SamplingParams, EngineConfig, TurboQuantConfig # Enable 3-5x KV cache compression engine_cfg = EngineConfig( turbo_quant=TurboQuantConfig.quality_neutral(), # 4-bit K, 3-bit V ) llm = LLM("meta-llama/Llama-3-8B", engine_config=engine_cfg) outputs = llm.generate(["Summarize this article..."], SamplingParams(max_tokens=512))
Start an API Server
# Start with API key authentication export SWIFTLLM_API_KEY="your-secret-key" swiftllm serve --model meta-llama/Llama-3-8B --port 8000
CLI — Advanced Generation
# Self-consistency with 8 chains swiftllm generate -p "What is 12 x 15?" --self-consistency 8 # Best-of-N dense verification swiftllm generate -p "Prove the Pythagorean theorem" --best-of-n 4 # Recursive Language Model (depth=3) swiftllm generate -p "Solve step by step: 2^10 mod 7" --rlm 3 # Multi-round self-refinement swiftllm generate -p "Write a haiku about Rust" --refinement-rounds 3
Air-Gapped Installation
SwiftLLM supports deployment on isolated networks with no internet access. Create a portable bundle on a connected machine, then install offline. The bundle includes source, Python wheels, rustup-init, and optionally pre-downloaded models with SHA256 checksum verification.
# On a connected machine — basic bundle ./airgap-bundle.sh # Include a model in the bundle ./airgap-bundle.sh --model "Qwen/Qwen2.5-0.5B-Instruct-GGUF:qwen2.5-0.5b-instruct-q4_k_m.gguf" # Cross-architecture bundle (x86_64 host → ARM64 target) ./airgap-bundle.sh --arch aarch64 -o swiftllm-bundle-arm64.tar.gz # CPU-only wheels ./airgap-bundle.sh --cpu -o /mnt/usb/swiftllm-bundle.tar.gz # macOS Apple Silicon ./airgap-bundle.sh --arch arm64 --platform macosx_11_0_arm64
# On the air-gapped target tar xzf swiftllm-airgap-bundle.tar.gz cd swiftllm-airgap-bundle/swiftllm ./install.sh --airgap # Runtime offline mode (disable all HF downloads) export SWIFTLLM_OFFLINE=1 swiftllm generate -m /path/to/local/model.gguf -p "Hello"
The --arch flag auto-maps to the correct pip platform tag and rustup target triple for
cross-architecture bundles. Supported targets: x86_64 Linux/macOS, aarch64 Linux, arm64 macOS (Apple Silicon).
Update
The update.sh lifecycle script pulls the latest source, rebuilds the wheel, and reinstalls
the package — preserving your virtual environment and downloaded models.
# Pull latest and rebuild ./update.sh # Switch to a specific branch ./update.sh --branch main-hybrid-rd # Switch to a specific tag ./update.sh --tag v2.0.5 # Clean build artifacts before rebuilding ./update.sh --clean # Rebuild from current source (skip git pull) ./update.sh --no-pull # Force CPU-only rebuild ./update.sh --cpu
Update Flags
| Flag | Description |
|---|---|
--branch NAME |
Switch to a specific git branch before rebuilding |
--tag TAG |
Check out a specific release tag |
--clean |
Remove target/ build artifacts before rebuilding |
--no-pull |
Rebuild from current source without running git pull |
--cpu / --gpu |
Force CPU-only or GPU/CUDA rebuild |
Uninstall
The uninstall.sh lifecycle script cleanly removes SwiftLLM, with options to preserve models
and virtual environments.
# Interactive uninstall (prompts before each step) ./uninstall.sh # Uninstall but keep downloaded models ./uninstall.sh --keep-models # Remove everything non-interactively ./uninstall.sh --purge --yes
Uninstall Flags
| Flag | Description |
|---|---|
--keep-models |
Preserve downloaded models in ~/.cache/swiftllm/models |
--keep-venv |
Preserve the Python virtual environment |
--purge |
Remove everything: package, model cache, venv, and build artifacts |
--yes |
Non-interactive mode — skip all confirmation prompts |
Caution: --purge permanently deletes all downloaded models and cached data.
Use --keep-models if you want to reinstall later without re-downloading.
Inference
SwiftLLM's inference engine is built on PagedAttention with continuous batching, speculative decoding, and KV-cache reuse — designed to serve dozens of concurrent requests on a single GPU without head-of-line blocking. v2.0 adds six research-derived test-time inference enhancements and TurboQuant KV cache compression.
Sync, async, and streaming
LLM— synchronous engine for single-shot generation. Best for scripts and notebooks.AsyncLLM— async engine withasync fortoken streaming. Best for serving and concurrent workloads.- OpenAI-compatible HTTP —
swiftllm serveexposes/v1/completionsand/v1/chat/completionswith SSE streaming.
Sampling parameters
| Parameter | Description |
|---|---|
temperature |
0 = greedy. Higher = more random. Default 1.0. |
top_p |
Nucleus sampling. Default 1.0 (off). |
top_k |
Top-K sampling. Default 0 (off). |
max_tokens |
Cap on generated tokens. Required. |
min_p |
Minimum probability threshold. Default 0 (off). |
stop |
List of strings to halt generation. |
presence_penalty / frequency_penalty |
Discourage repetition. Default 0. |
Streaming example
from swiftllm import AsyncLLM, SamplingParams llm = AsyncLLM("meta-llama/Llama-3-8B") params = SamplingParams(temperature=0.7, max_tokens=256) async for token in llm.stream("Explain PagedAttention in one paragraph.", params): print(token, end="", flush=True)
Continuous batching means a long-running request never blocks short ones — new requests slot into the next decoder step.
Test-Time Inference Enhancements (Phase 3)
Six research-derived methods for improving output quality at inference time. Each is available as a Python API method and a CLI flag.
| Method | Python API | CLI Flag | Description |
|---|---|---|---|
| Self-Consistency | llm.generate_with_self_consistency() |
--self-consistency N |
Majority voting over N independent chains (Wang et al., 2022). Configurable answer extractor (regex, sentinel, JSON, freeform). Ties broken by mean log-prob. |
| Self-Refinement | llm.generate_with_refinement() |
--refinement-rounds N |
Iterative critique-revision loop (Madaan et al., 2023). Stops when normalized edit distance improvement falls below threshold. |
| Best-of-N | llm.generate_best_of_n() |
--best-of-n N |
Generates N candidates, scores via rule-based, neural PRM, ensemble, or log-prob strategy, returns highest-scoring. |
| Disaggregated Serving | DisaggregatedServingConfig |
— | Separates prefill and decode into independent worker pools (Splitwise/DistServe). Round-robin, least-loaded, or locality-aware scheduling. |
| RLM | llm.generate_with_rlm() |
--rlm DEPTH |
Recursive Language Model — bounded recursive self-calling with REPL sandbox (Assign/Compute/Verify/Recurse steps), variable binding table, and complexity-classifier MLP. |
| Dense Verification | llm.generate_with_dense_verification() |
--dense-verification |
Cross-attention scoring of draft tokens against REPL trace. Per-token and per-step confidence. Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN. |
TurboQuant KV Cache Compression
ICLR 2026 (Zandieh et al., arXiv 2504.19874) — online vector quantization for KV cache. Random rotation via fast Walsh-Hadamard transform with deterministic sign flips, then scalar quantization using precomputed Beta-distribution Max-Lloyd codebooks at 1-4 bits per channel.
| Preset | Key Bits | Value Bits | Compression | Use Case |
|---|---|---|---|---|
quality_neutral() |
4 | 3 | ~4x | Minimal quality loss — recommended default |
aggressive() |
3 | 2 | ~5x | Maximum memory savings — some quality tradeoff |
| Custom | 1-4 | 1-4 | Varies | Per-use-case tuning via TurboQuantConfig |
Two quantization variants: TurboQuantMse (minimizes MSE, best for general use) and
TurboQuantProd (unbiased inner-product estimation via JL sign sketch, best when attention score
preservation matters more than individual vector reconstruction).
Recursive Language Model (RLM)
Bounded recursive self-calling with a REPL sandbox, variable binding table, and complexity-classifier MLP.
Four operating modes: DISABLED, SHALLOW (depth=1), REASONING
(depth=3, default), AGENTIC (depth=5).
from swiftllm import LLM, RlmConfig, RlmMode, SamplingParams llm = LLM(model="path/to/model") results = llm.generate_with_rlm( "Prove by induction that 1+2+…+n = n(n+1)/2.", config=RlmConfig(mode=RlmMode.REASONING, max_depth=3, enable_repl=True, early_exit_threshold=0.92), base_params=SamplingParams(temperature=0.7, max_tokens=768), ) result = results[0] print(f"Depth used: {result.recursion_depth_used}") print(f"REPL steps: {len(result.repl_trace)}")
Dense Verification Layer
Cross-attention scoring of draft tokens against REPL execution trace. Per-token and per-step confidence.
Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN.
from swiftllm import LLM, DenseVerificationConfig, VerificationStrategy llm = LLM(model="path/to/model") results = llm.generate_with_dense_verification( "Explain Gödel's incompleteness theorems.", config=DenseVerificationConfig( strategy=VerificationStrategy.GATE_AND_REGEN, min_confidence=0.80, max_regen_attempts=3), ) result = results[0] print(f"Confidence: {result.global_score:.1%}") print(f"Accepted on attempt: {result.accepted_on_attempt}")
API Server
SwiftLLM includes an OpenAI-compatible HTTP server built with Axum, supporting streaming, API key
authentication, CORS, and security hardening. As of v2.0 the server binds to 127.0.0.1 by
default (localhost only) to prevent accidental exposure on open networks.
Endpoints
POST /v1/completions— Text completionPOST /v1/chat/completions— Chat completionGET /v1/models— List loaded modelsGET /health— Health check (unauthenticated)GET /metrics— Prometheus metrics (authenticated — requires API key whenSWIFTLLM_API_KEYis set)
Security Hardening
v2.0 ships with production-grade security defaults out of the box:
- Default bind
127.0.0.1— the server listens on localhost only. SetSWIFTLLM_HOST=0.0.0.0to expose externally. - Constant-time API key auth — set
SWIFTLLM_API_KEYto requireAuthorization: Bearer <key>on all endpoints except/health. Uses constant-time comparison to prevent timing attacks. - CORS from config — allowed origins are configured via
ServerConfig.cors_originsorSWIFTLLM_CORS_ORIGINS. Defaults to same-origin only. - Authenticated
/metrics— Prometheus metrics are gated behind the same API key, preventing info leakage. - Rate limiting — configurable via
tower-httpmiddleware. Defaults to no limit; setSWIFTLLM_RATE_LIMITfor requests/second cap.
Launch examples
# Basic — localhost only, no auth swiftllm serve --model meta-llama/Llama-3-8B # Production — external, with API key and CORS export SWIFTLLM_API_KEY="sk-your-secret-key" export SWIFTLLM_CORS_ORIGINS="https://app.example.com" swiftllm serve --model meta-llama/Llama-3-8B --host 0.0.0.0 --port 8000 # Verify auth is required curl http://localhost:8000/v1/models \ -H "Authorization: Bearer sk-your-secret-key"
The server is a drop-in replacement for the OpenAI API. Point any OpenAI client (Python, Node, curl) at
http://localhost:8000/v1 and it just works — including SSE streaming for chat completions.
Training & Fine-Tuning
SwiftLLM supports full training, LoRA, QLoRA fine-tuning, and GRPO reinforcement learning with three
optimizer backends: Muon (Newton-Schulz orthogonalization), AdamW
(decoupled weight decay), and SGD (optional Nesterov momentum). The CLI exposes subcommands:
swiftllm train, swiftllm finetune, and swiftllm grpo.
Simulated backend: CUDA training kernels are not yet wired up. Every
training run prints a visible [SIMULATED] banner. Real gradient training ships in a later release.
Supported Methods
| Method | Description | Memory |
|---|---|---|
| LoRA | Low-Rank Adaptation — trains small adapter matrices | Low |
| QLoRA | 4-bit quantized base model + LoRA adapters (~65-70% less VRAM) | Very Low |
| Full | Full parameter fine-tuning | High |
LoRA Fine-Tuning
from swiftllm import fine_tune, LoRAConfig, TrainingConfig lora = LoRAConfig(r=16, alpha=32, dropout=0.05) config = TrainingConfig(learning_rate=2e-4, num_epochs=3, per_device_batch_size=4) fine_tune(model="meta-llama/Llama-3-8B", train_data="./my_data.jsonl", lora_config=lora, training_config=config, output_dir="./output")
# CLI — LoRA fine-tuning swiftllm finetune -m meta-llama/Llama-3-8B \ --train-data ./data/train.jsonl \ --lora-r 16 --lora-alpha 32 --learning-rate 2e-4 # CLI — QLoRA (4-bit base + LoRA) swiftllm train -m meta-llama/Llama-3-8B \ --train-data ./data/train.jsonl --method qlora \ --lora-r 16 --mixed-precision bf16
Auto-Ingest from Files or HuggingFace
Pass a directory, file list, or HuggingFace dataset directly to fine_tune() — SwiftLLM
automatically converts it to JSONL before training. No manual dataset preparation needed.
# Fine-tune from a directory (auto-ingested to JSONL) fine_tune(model="meta-llama/Llama-3-8B", train_data="./my_corpus/", output_dir="./output", lora_r=16) # Fine-tune from a HuggingFace dataset fine_tune(model="meta-llama/Llama-3-8B", hf_dataset="tatsu-lab/alpaca", dataset_format="sft_completion", output_dir="./output", lora_r=32) # Combine local files + HuggingFace in one command fine_tune(model="meta-llama/Llama-3-8B", train_data="./my_docs/", hf_dataset="tatsu-lab/alpaca", hf_max_samples=10_000, dataset_format="sft_completion", output_dir="./output", lora_r=16)
GRPO Reinforcement Learning
Group Relative Policy Optimization — RL fine-tuning without a critic model. Includes CGAR depth curriculum (1.71x speedup), Process Reward Models (5 aggregation strategies for step-level feedback), and LongR dense token-level rewards (+9% LongBench v2).
from swiftllm import GrpoTrainer, TrainingConfig from swiftllm.config import GrpoConfig, CgarConfig, PrmConfig config = TrainingConfig( model="meta-llama/Llama-3-8B", train_data="./data/math_prompts.jsonl", output_dir="./grpo_output", num_layers=32, grpo=GrpoConfig(group_size=8, kl_coeff=0.04), cgar=CgarConfig(), prm=PrmConfig(aggregation="last_step"), ) trainer = GrpoTrainer(config) trainer.train()
# GRPO with full research stack: CGAR + PRM + LongR swiftllm grpo -m meta-llama/Llama-3-8B \ --train-data ./data/math_prompts.jsonl \ --group-size 8 --enable-prm --long-reward-weight 0.1 \ --num-layers 32
Dataset Ingestion
Convert directories of text, code, PDF, DOCX, CSV, HTML, and JSON files into JSONL training data in one
command. Supports HuggingFace Hub datasets via HuggingFaceSource. Four output schemas:
pretraining, sft_messages, sft_completion, code.
SHA-256 deduplication across all sources.
# Local files — code fine-tuning swiftllm dataset --input ./src/ ./docs/ --output train.jsonl --format code # HuggingFace dataset swiftllm dataset --hf-dataset tatsu-lab/alpaca --output alpaca.jsonl --format sft_completion # Mixed: local files + HuggingFace combined swiftllm dataset --input ./my_docs/ --hf-dataset tatsu-lab/alpaca \ --hf-max-samples 10000 --format sft_completion --output combined.jsonl # Large corpus with streaming (avoids full download) swiftllm dataset --hf-dataset HuggingFaceFW/fineweb --hf-subset sample-10BT \ --hf-streaming --hf-max-samples 100000 --format pretraining --output fineweb.jsonl
Training Data Formats
Three accepted input formats for supervised fine-tuning:
| Format | Schema | Use Case |
|---|---|---|
| Messages (JSONL) | {"messages": [{"role": "...", "content": "..."}]} |
Chat / instruction tuning (recommended) |
| Prompt-Completion (JSONL) | {"prompt": "...", "completion": "..."} |
Classic SFT, GRPO (add "answer" for RL) |
| Plain Text / CSV | One example per line or CSV with prompt/completion columns |
Language modelling pretraining |
Advanced Inference
SwiftLLM v2.0 includes six test-time inference enhancements accessible via Python API and CLI flags. These improve output quality by spending more compute at inference time — no training required.
- Self-Consistency:
llm.generate_with_self_consistency()— majority voting over N independent chains (Wang et al., 2022) - Self-Refinement:
llm.generate_with_refinement()— iterative critique→revision cycles (Madaan et al., 2023) - Best-of-N:
llm.generate_best_of_n()— dense scoring and reranking of N candidates - Disaggregated Serving:
DisaggregatedServingConfig— separate prefill/decode worker pools (Splitwise/DistServe) - RLM:
llm.generate_with_rlm()— recursive self-calling with REPL sandbox and variable binding - Dense Verification:
llm.generate_with_dense_verification()— cross-attention token/step confidence scoring - TurboQuant: Set
EngineConfig(turbo_quant=TurboQuantConfig.quality_neutral())for 3-5x KV cache compression
CLI Commands
The swiftllm CLI provides subcommands covering inference, serving, training, dataset ingestion, and model
management.
serve
OpenAI-compatible API server
generate
Batch text generation with --self-consistency, --refinement-rounds, --best-of-n, --rlm, --dense-verification flags
chat
Interactive REPL
download
Pull models from HF Hub
benchmark
Throughput & TTFT tests
convert
HF ↔ SafeTensors ↔ GGUF
info
Model and system info
train
Full / LoRA / QLoRA training
finetune
LoRA convenience command
grpo
GRPO RL training with CGAR/PRM/LongR
dataset
Multi-format ingestion to JSONL with HuggingFace support
Global Flags
| Flag | Description |
|---|---|
--model |
HF repo id, local path, or GGUF file |
--dtype |
auto, float16, bfloat16, float32,
int8, int4
|
--quantization |
awq, gptq, bnb4, turboquant, or none |
--tensor-parallel |
Number of GPUs for tensor parallelism |
--max-model-len |
Override the model's context length |
Python API
Inference
| Export | Description |
|---|---|
LLM |
Synchronous inference engine |
AsyncLLM |
Async engine with streaming iterators |
SamplingParams |
temperature, top_p, top_k, max_tokens, stop strings |
RequestOutput |
Per-request result wrapper |
EngineConfig |
gpu_memory_utilization, dtype, quantization, max_model_len |
Training
| Export | Description |
|---|---|
Trainer |
High-level training loop with callbacks, early stopping, checkpointing |
TrainingConfig |
learning_rate, num_epochs, per_device_batch_size, lr_scheduler, mixed_precision |
LoRAConfig |
r, alpha, dropout, target_modules, use_rslora |
fine_tune |
One-call convenience — accepts JSONL, directory, or HuggingFace dataset |
GrpoTrainer |
GRPO RL training with CGAR curriculum integration |
GrpoConfig |
group_size, kl_coeff, clip_eps, correctness/format/length weights |
CgarConfig |
Curriculum-Guided Adaptive Recursion — shallow_end, medium_end phases |
PrmConfig |
Process Reward Model — 5 aggregation strategies (Min/Mean/Product/LastStep/WeightedMean) |
LongRewardConfig |
LongR dense token-level rewards — weight, aggregation, normalise |
grpo_train |
One-call GRPO convenience function |
Dataset Ingestion
| Export | Description |
|---|---|
DatasetIngester |
Full-control multi-format dataset ingestion API |
IngestionConfig |
input_paths, output_path, format, chunk_size, hf_sources |
IngestionResult |
Ingestion summary — chunks, files, HF rows, skipped files |
HuggingFaceSource |
Describes one HuggingFace Hub dataset with auto-detected schema |
DatasetFormat |
Output format enum — pretraining, sft_messages, sft_completion, code |
ingest_dataset |
One-liner for local files, HuggingFace, or combined ingestion |
prepare_dataset |
Convenience shortcut available as swiftllm.prepare_dataset() |
Phase 3 — Inference Configs
| Export | Description |
|---|---|
SelfConsistencyConfig |
num_samples, extractor strategy, temperature |
RefinementConfig |
max_rounds, stopping criterion, improvement metric |
VerificationConfig |
num_candidates, scoring strategy, threshold |
DisaggregatedServingConfig |
Separate prefill/decode worker pool config |
RlmConfig |
max_depth, mode (DISABLED/SHALLOW/REASONING/AGENTIC) |
DenseVerificationConfig |
min_confidence, strategy (SCORE_ONLY/GATE/GATE_AND_REGEN) |
TurboQuantConfig |
key_bits, value_bits, presets: quality_neutral(), aggressive() |
Hybrid Architecture (Phase 1)
| Export | Description |
|---|---|
HybridModelBuilder |
Fluent builder for hybrid Mamba-3 models |
build_mamba3_reasoning_model |
Mamba-3 + LatentMoE + RLM + Dense Verification preset |
HybridModelConfig |
Top-level architecture configuration |
MambaConfig |
Mamba-3 SSM layer configuration |
LatentMoeConfig |
Latent MoE FFN configuration (87.5% less inter-GPU traffic) |
estimate_parameters |
Compute total parameter count for a HybridModelConfig |
parameter_summary |
Formatted breakdown of parameter counts by component |
Configuration
Every SWIFTLLM_* environment variable maps to a field in EngineConfig or
ServerConfig. Set them in your shell profile, systemd unit, or Docker Compose file — they are
read at startup and override coded defaults. Explicit constructor arguments and CLI flags always take final
precedence.
GPU & Memory
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_GPU_MEMORY_UTILIZATION |
0.90 | Fraction of GPU VRAM for model weights + KV cache (0.0–1.0). Raise to ~0.95 on dedicated hosts. |
SWIFTLLM_GPU_OVERHEAD_MB |
0 | VRAM (in MB) to reserve for the OS and other processes. |
SWIFTLLM_NUM_GPU_LAYERS |
all | Number of layers to offload to GPU. 0 = CPU-only, 999 = all. |
SWIFTLLM_SWAP_SPACE |
4.0 | CPU swap space in GiB for KV cache offloading. |
SWIFTLLM_CPU_OFFLOAD_GB |
0.0 | Model weight gigabytes to keep on CPU RAM instead of GPU. |
SWIFTLLM_KV_CACHE_DTYPE |
auto | Data type for the KV cache. fp8_e4m3/fp8_e5m2 halves memory. |
SWIFTLLM_BLOCK_SIZE |
16 | Tokens per PagedAttention block. Allowed: 8, 16, 32. |
SWIFTLLM_FLASH_ATTENTION |
true | Enable FlashAttention kernels. |
SWIFTLLM_ENFORCE_EAGER |
false | Disable CUDA graph capture; use eager execution. |
CUDA_VISIBLE_DEVICES |
(all) | Restrict which GPUs are visible, e.g. 0,2. |
Tensor Parallelism & Multi-GPU
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_TENSOR_PARALLEL_SIZE |
1 | GPUs for tensor parallelism. Must evenly divide attention heads. |
SWIFTLLM_PIPELINE_PARALLEL_SIZE |
1 | Pipeline-parallel stages. |
NCCL_DEBUG |
— | NCCL logging level: INFO, WARN, TRACE. |
NCCL_P2P_DISABLE |
0 | Set to 1 to disable GPU peer-to-peer on certain PCIe topologies. |
Scheduling & Batching
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_MAX_NUM_SEQS |
256 | Maximum concurrent sequences in a batch. |
SWIFTLLM_MAX_NUM_BATCHED_TOKENS |
8192 | Maximum total tokens per forward pass. |
SWIFTLLM_MAX_PADDINGS |
256 | Maximum padding tokens tolerated per batch. |
SWIFTLLM_SCHEDULER_POLICY |
fcfs | fcfs, sjf, or priority. |
SWIFTLLM_PREEMPTION_MODE |
swap | swap (KV cache to CPU) or recompute. |
SWIFTLLM_ENABLE_PREFIX_CACHING |
false | Reuse KV cache across requests sharing the same prefix. |
SWIFTLLM_ENABLE_CHUNKED_PREFILL |
false | Interleave prefill and decode; reduces time-to-first-token. |
SWIFTLLM_NUM_PARALLEL |
1 | Parallel inference slots per model. |
SWIFTLLM_MAX_LOADED_MODELS |
1 | Models held in GPU memory simultaneously. |
SWIFTLLM_KEEP_ALIVE |
300 | Seconds a model stays loaded after last request. |
Speculative Decoding
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_SPECULATIVE_MODEL |
— | Draft model for speculative decoding. |
SWIFTLLM_NUM_SPECULATIVE_TOKENS |
5 | Tokens to draft per step. |
SWIFTLLM_SPECULATIVE_MAX_MODEL_LEN |
— | Override max sequence length for the draft model. |
Model & Weights
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_MODEL_DIR |
~/.cache/swiftllm/models | Default directory for downloaded models. |
SWIFTLLM_OFFLINE |
false | Set to 1 to disable all network downloads (air-gapped mode). |
SWIFTLLM_DTYPE |
auto | Weight data type: auto, float16, bfloat16, float32, int8, int4, fp8_e4m3, fp8_e5m2. |
SWIFTLLM_QUANTIZATION |
none | Quantization method: none, awq, gptq, squeezellm, gguf. |
SWIFTLLM_MAX_MODEL_LEN |
(model default) | Override the model's max sequence length. |
SWIFTLLM_TRUST_REMOTE_CODE |
false | Allow executing custom code from HuggingFace repos. |
SWIFTLLM_DEVICE |
auto | Device: auto, cuda, cpu, metal, rocm. |
SWIFTLLM_SEED |
0 | Global random seed. |
HF_TOKEN |
— | HuggingFace API token for gated models. |
LoRA
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_ENABLE_LORA |
false | Enable LoRA adapter support in the inference engine. |
SWIFTLLM_MAX_LORAS |
1 | Maximum LoRA adapters loaded simultaneously. |
SWIFTLLM_MAX_LORA_RANK |
16 | Maximum LoRA rank. |
Server & Networking
| Variable | Default | Description |
|---|---|---|
SWIFTLLM_HOST |
127.0.0.1 | HTTP bind address. Localhost by default for security; set 0.0.0.0 to expose externally. |
SWIFTLLM_PORT |
8000 | Listen port. |
SWIFTLLM_API_KEY |
— | Bearer-token API key (constant-time comparison). |
SWIFTLLM_CORS_ALLOW_ORIGINS |
* | Comma-separated allowed CORS origins. |
SWIFTLLM_SSL_CERTFILE |
— | Path to TLS certificate for HTTPS. |
SWIFTLLM_SSL_KEYFILE |
— | Path to TLS private key for HTTPS. |
SWIFTLLM_ROOT_PATH |
— | URL prefix for reverse-proxy deployments. |
SWIFTLLM_SERVED_MODEL_NAME |
— | Override model name in API responses. |
SWIFTLLM_MAX_LOG_LEN |
— | Truncate request/response logs to this many characters. |
SWIFTLLM_RESPONSE_ROLE |
assistant | Default role name in chat completion responses. |
SWIFTLLM_METRICS_ENABLED |
true | Expose Prometheus /metrics (authenticated when API key is set). |
Build & CUDA
| Variable | Default | Description |
|---|---|---|
CUDA_PATH / CUDA_HOME |
— | Path to CUDA toolkit. |
CUDACXX |
— | Path to nvcc binary. |
CMAKE_ARGS |
— | Extra CMake arguments for llama-cpp-python build. |
Logging & Debug
| Variable | Default | Description |
|---|---|---|
RUST_LOG |
info | Rust log level: trace, debug, info, warn, error. |
SWIFTLLM_LOG_LEVEL |
— | Python log level: DEBUG, INFO, WARNING, ERROR. |
Supported Models
Ten model families supported out of the box, with auto-detection for HuggingFace, GGUF, SafeTensors, and PyTorch formats. Phase 1 research models (Mamba, Jamba) are available when building with hybrid architecture support.
| Model Family | Variants | Notes |
|---|---|---|
| LLaMA 3 | 8B, 70B, 405B | Meta's latest flagship. Full tensor-parallel support for 70B/405B. |
| Code Llama | 7B, 13B, 34B | Code-specialized. Python, Instruct, and base variants. |
| Mistral / Mixtral / Devstral | 7B, 8x7B, Devstral | Dense (7B), MoE (Mixtral 8x7B), and code-agent (Devstral) variants. |
| Qwen 3 / 3.5 | 0.6B – 235B | Alibaba multilingual. MoE and dense variants. |
| Phi-3 / Phi-4 | Mini, Small, Medium | Microsoft small-language models optimized for edge deployment. |
| Deepseek R1 / V4 | R1-Distill, V4-0324 | Reasoning-focused. R1 chain-of-thought and V4 MoE architectures. |
| Gemma | 2B, 7B | Google open models. Instruct and base variants. |
| Mamba | 130M – 2.8B | Phase 1 — SSM backbone. Requires hybrid architecture build. |
| Jamba | Hybrid configs | Phase 1 — Mamba + Attention + MoE hybrid (AI21 architecture). |
All models support HuggingFace (SafeTensors/PyTorch), GGUF, and raw SafeTensors formats. Format is auto-detected
by file extension. Use swiftllm convert to move between formats.
Get help
Three channels, tuned to different needs. Community-first, paid support available for production deployments.
GitHub Discussions
Ask questions, share benchmarks, and trade tips with the community.
Open forumIssue Tracker
Found a bug or want to request a feature? Open an issue with a minimal repro.
Report an issueEnterprise Support
SLAs, private deployments, and architecture reviews for production workloads.
Contact usFrequently asked questions
--gpu or run on a machine with nvidia-smi
detected. CPU wheels ship for x86_64 and aarch64 on both Linux and macOS, plus arm64 on Apple Silicon.swiftllm convert to move
between formats.[SIMULATED] banner.
Real GPU gradient training is planned for a future release — follow the changelog for updates.
airgap-bundle.sh on a connected machine to build a portable
tarball with source, Python wheels, the Rust installer, and optional pre-downloaded models. SHA256
checksums verify integrity on the target.swiftllm benchmark --concurrency 32 --num-requests 1000. We consistently see competitive
tokens/sec on LLaMA-3-8B with PagedAttention + continuous batching enabled. See the changelog for tested
throughput per release.
/v1/completions and /v1/chat/completions
including streaming via SSE. Point any OpenAI client (Python, Node, curl) at
http://localhost:8000/v1 and it just works. API key auth, CORS, and Prometheus metrics are
built in.
Troubleshooting
Installer fails with "externally-managed environment"
You're on Debian 12+, Ubuntu 23.04+, or recent macOS. Let the installer create a venv (the default), pass
--venv ~/.swiftllm, or activate your own environment before running.
nvidia-smi detected but CUDA build fails
Ensure nvcc is on your PATH (which nvcc). Driver version must be compatible with
CUDA 11.8+. On Ubuntu, the nvidia-cuda-toolkit package is often too old — install CUDA directly
from NVIDIA.
Training shows [SIMULATED]
This is expected behaviour — the training loop is a simulation stub. Full GPU training requires PyTorch and a CUDA environment. See the Training section.
Model downloads hang
Set HF_HUB_ENABLE_HF_TRANSFER=1 for the accelerated downloader, or pass
--revision main to pin the reference. For air-gapped installs, set
SWIFTLLM_OFFLINE=1 and pre-populate the cache.
Release history
Track every update, feature, and fix as SwiftLLM evolves.
Stable release — all beta features promoted to production. Hybrid Mamba-3, GRPO training, TurboQuant KV cache compression, and advanced test-time inference.
- AddedHybrid Mamba-3 SSM + Transformer + MoE architecture (Phase 1) — selective SSM with MIMO multi-head scan, LatentMoE with dynamic-bias load balancing, Jamba-style hybrid blocks
- AddedGRPO / CGAR / PRM / LongReward training pipelines (Phase 2) — reinforcement learning fine-tuning without a critic model
- AddedSelf-consistency, refinement, best-of-N, disaggregated serving, Recursive Language Model, and Dense Verification (Phase 3)
- AddedTurboQuant KV cache compression (ICLR 2026) — Walsh-Hadamard rotation + Max-Lloyd codebooks for 3-5x memory reduction
- AddedLifecycle scripts (
install.sh,update.sh,uninstall.sh) for full install management - AddedComprehensive static analysis sweep — zero Clippy warnings, logic bug fixes
- Added329 backend tests and 66 frontend tests — all passing
TurboQuant KV cache compression — online vector quantization from ICLR 2026.
- AddedFull TurboQuant implementation (Zandieh et al., arXiv 2504.19874) — random rotation via fast Walsh-Hadamard transform, precomputed Max-Lloyd codebooks for 1-4 bit scalar quantization
- Added
TurboQuantMse(MSE-optimal) andTurboQuantProd(unbiased inner products via JL sign sketch) variants - Added
TurboQuantConfigin Rust and Python withquality_neutral()andaggressive()presets - Added
TurboQuantKvCache— compressed KV cache layer with slot-based store/load/clear - Added25 Rust + 16 Python tests covering codebook construction, rotation invertibility, roundtrip quality, memory stats
Code quality & zero-warning build, lifecycle scripts, comprehensive static analysis and bug fixes.
- FixedResolved all 130+ compiler warnings across 4 workspace crates — dead code annotations, doc comments, unused imports
- FixedLogic bug in
CurriculumScheduler::ssm_lr_scale()— identical if/else branches; replaced with correct linear inverse mapping - FixedIncorrect
#[cfg(has_cuda)]in PyO3 bindings — changed to proper#[cfg(feature = "cuda")] - Added
update.shanduninstall.shlifecycle scripts with branch/tag switching and purge options - FixedPython SDK imports cleanly without PyTorch —
torch_modelfeatures guarded withtry/except ImportError - Added304 Rust tests + 50 Python tests — all passing with zero actionable warnings
Security audit, CUDA acceleration kernels, and PyTorch integration.
- SecurityCritical: Added
Dropfor CUDA storage,check_cuda_last_error()after all kernel launches, constant-time API key comparison - SecurityHigh: CORS from config, authenticated
/metrics, default bind127.0.0.1, input validation on legacy endpoints - AddedCustom CUDA kernels:
mamba3_scan.cu,moe_dispatch.cu,dense_verif_attn.cu,rlm_ops.cu,linear_f16.cu - Added
hybrid_model.pyandtorch_model.py— PyTorchnn.Modulebridge for GPU-executable hybrid models - FixedInteger overflow in block sizing, TOCTOU race in scheduler, UTF-8 slicing panic in repr
Phase 1 hybrid architectures, Phase 2 training enhancements, Phase 3 inference and model-level reasoning.
- AddedPhase 1:
mamba.rs,moe.rs,jamba.rs— Mamba SSM with MIMO scan, sparse MoE routing, Jamba hybrid blocks - AddedPhase 2:
grpo.rs,curriculum.rs,process_reward.rs,long_reward.rs— GRPO RL, CGAR curriculum, PRM step scoring, LongR dense rewards - AddedPhase 3: Self-consistency voting, multi-round self-refinement, best-of-N verification, disaggregated prefill/decode serving
- AddedPhase 3 Model:
rlm.rs(Recursive Language Model with REPL sandbox) anddense_verification.rs(cross-attention draft scoring) - Added14 new Python config dataclasses, CLI flags for self-consistency/refinement/best-of-N/RLM/dense-verification, 4 new example scripts
HuggingFace dataset support and multi-format dataset ingestion pipeline.
- Added
HuggingFaceSource— auto-detects Alpaca, ShareGPT, OpenAI messages, prompt/completion, Q&A, and plain text schemas - AddedMulti-format ingestion:
.txt,.md,.py,.rs,.pdf,.docx,.csv,.html,.jsonland 40+ more into JSONL - Added4 output schemas: pretraining, sft_messages, sft_completion, code — with SHA-256 dedup and code-aware chunking
- AddedCLI
swiftllm datasetsubcommand with 14--hf-*flags; streaming mode for large corpora - AddedAuto-ingest in
Trainerandfine_tune()— directories and HF datasets converted transparently
Training UX fixes — simulated-stub banner, train-data path validation, and full regression coverage.
- Fixed
Trainer.train()now prints a visible[SIMULATED]banner making it obvious the current loop is a stub (no weights, no gradients) rather than real training - Fixed
swiftllm trainandswiftllm finetunevalidate--train-data/--eval-datapaths up front — missing, non-regular, or empty files now fail fast with clear errors - ChangedValidation covers paths
loaded from
--configJSON so typos in saved configs surface immediately - AddedFull regression matrix: install → download → generate (18.61 tok/s) → finetune (LoRA) → train on Ubuntu 24.04 + CUDA 13.0
CPU & ARM wheel support, installer portability, and critical Rust-side safety fixes.
- AddedCPU-only build is now the default — portable wheel with zero CUDA dependencies, buildable on Apple Silicon, Graviton, Raspberry Pi, Jetson, Ampere
- AddedTop-level
swiftllmcrate exposescpuandcudaCargo features; CUDA opt-in via./install.sh --gpu - Added
airgap-bundle.sh --archflag auto-maps to pip platform tags and rustup target triples for cross-architecture bundles - Added
[serve]optional dependency (fastapi, uvicorn) wired intoinstall.shand the airgap bundle - FixedPEP 668 handling on Ubuntu 23.04+ / Debian 12+ externally-managed Python
- FixedReplaced non-portable
grep -oPwithsedfor CUDA detection (macOS/BSD compatible) - SecurityFixed 14
partial_cmp().unwrap()calls — replaced withunwrap_or(Ordering::Equal)to prevent NaN panics - SecurityAdded
checked_addbounds validation in GGUF loader — malformed files can no longer cause slice panics
Air-gapped installation and comprehensive security hardening.
- Added
airgap-bundle.shcreates portable install archives (source, pip wheels, rustup-init, optional models) - Added
install.sh --airgapflag for fully offline installation from a bundle - AddedRuntime offline mode via
SWIFTLLM_OFFLINE=1— disables all HF downloads - SecurityCritical: Fixed JSON injection in SSE streaming — replaced
raw
format!()withserde_json::json!() - SecurityCritical: Fixed shell injection in
airgap-bundle.sh— model names now passed viasys.argv - SecurityCritical: Added SHA256 checksum verification for downloaded
rustup-init - SecurityPath validation against directory traversal; symlink traversal protection in offline cache
- FixedUse-after-move in
engine.rs; usize negation intrainer.rs; missingmutoneval_data
Training & fine-tuning infrastructure, engine optimization, scheduler improvements.
- Added
swiftllm-trainingRust crate with LoRA / QLoRA / full fine-tuning - AddedMuon optimizer with Newton-Schulz orthogonalization, plus AdamW and SGD with linear/cosine/constant LR schedulers
- AddedDataset loading (JSONL/CSV/text) with instruction templates; rolling-window metrics; perplexity
- AddedCheckpoint save/load with
save_total_limit; PythonTrainerwith callbacks andEarlyStoppingConfig - AddedO(n) top-k / beam / logprobs selection via quickselect;
/metricsJSON and Prometheus endpoints - FixedEliminated redundant read-then-write lock in sampling hot path; numerically stable log-softmax
- FixedO(n) victim selection for preemption scheduling; gradient clipping and LoRA scaling
Initial release — Rust core, PagedAttention, continuous batching, OpenAI API, Python SDK.
- AddedComplete Rust rewrite with 5 modular crates (core, models, cuda, server, training)
- AddedPagedAttention memory management: block allocator with copy-on-write
- AddedContinuous batching scheduler with preemption (swap / recompute)
- AddedToken sampling: greedy, temperature, top-k, top-p, min-p, repetition penalty
- AddedOpenAI-compatible HTTP API built with Axum; Python SDK (
LLM,AsyncLLM,SamplingParams) - AddedSpeculative decoding with draft-model acceleration; multi-GPU tensor and pipeline parallelism