Home Features Docs Changelog View on GitHub →
Documentation · v2.0

Everything you need to ship.

Install, configure, and run SwiftLLM for LLM inference, serving, and training — on Linux, macOS, x86_64, or ARM64.

⌘K

Installation

SwiftLLM provides a smart installer that auto-detects your system, installs dependencies, and builds the engine from source. As of v2.0 it ships portable abi3 wheels for Linux (x86_64, aarch64) and macOS (x86_64, arm64) — a single wheel covers Python 3.8 through 3.12.

Prerequisites

  • Python 3.8+ (3.10+ recommended)
  • Git and a C toolchain (cc/clang)
  • NVIDIA GPU with CUDA 11.8+ — optional; pass --cpu for CPU-only builds
  • Rust toolchain — installed automatically if missing
Shell
# Clone the repository
git clone https://github.com/infrabrew/swiftllm.git
cd swiftllm

# Run the installer (auto-detects CUDA vs CPU)
bash install.sh

Install Options

Flag Description
--cpu Force CPU-only build, skip CUDA detection
--gpu Force GPU/CUDA build, fail fast if CUDA is missing
--venv DIR Custom virtual environment location
--no-venv Install into current Python environment instead of creating a venv
--model-dir DIR Set model storage directory
--airgap Install from a pre-built offline bundle (no network required)

PEP 668 / externally-managed environments: On Debian 12+ and recent macOS, the installer will refuse to pip-install into the system Python. Let install.sh create a venv, or pass --venv.

Platforms

SwiftLLM ships pre-built wheels for the matrix below. Other platforms fall back to a source build, which the installer handles automatically.

OS Architecture Acceleration Wheel
Linux x86_64 CUDA 11.8+ / CPU ✓ pre-built
Linux aarch64 CPU (CUDA via source) ✓ pre-built
macOS x86_64 CPU ✓ pre-built
macOS arm64 (Apple Silicon) CPU + Metal (experimental) ✓ pre-built
Windows x86_64 CUDA / CPU WSL2 recommended

GPU requirements

  • NVIDIA: Compute Capability 7.0+ (Volta and later). Tested on A100, H100, L4, RTX 4090, RTX 3090.
  • VRAM: 16 GB minimum for 7B models in fp16; 24 GB+ recommended for serving with KV-cache headroom.
  • Drivers: NVIDIA driver 525+ for CUDA 12, 470+ for CUDA 11.8.

Apple Silicon Metal acceleration is in preview. CPU-only mode is the default on macOS — opt-in with SWIFTLLM_BACKEND=metal.

Quick Start

Basic Inference

Python
from swiftllm import LLM, SamplingParams

llm = LLM("meta-llama/Llama-3-8B")
params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["What is machine learning?"], params)
for output in outputs:
    print(output.text)

Self-Consistency Voting

Python
from swiftllm import LLM
from swiftllm.config import SelfConsistencyConfig

llm = LLM("meta-llama/Llama-3-8B")
results = llm.generate_with_self_consistency(
    "What is 12 x 15?",
    config=SelfConsistencyConfig(num_samples=8),
)
print(results[0].answer)  # majority-voted answer

TurboQuant KV Cache Compression

Python
from swiftllm import LLM, SamplingParams, EngineConfig, TurboQuantConfig

# Enable 3-5x KV cache compression
engine_cfg = EngineConfig(
    turbo_quant=TurboQuantConfig.quality_neutral(),  # 4-bit K, 3-bit V
)
llm = LLM("meta-llama/Llama-3-8B", engine_config=engine_cfg)
outputs = llm.generate(["Summarize this article..."],
    SamplingParams(max_tokens=512))

Start an API Server

Shell
# Start with API key authentication
export SWIFTLLM_API_KEY="your-secret-key"
swiftllm serve --model meta-llama/Llama-3-8B --port 8000

CLI — Advanced Generation

Shell
# Self-consistency with 8 chains
swiftllm generate -p "What is 12 x 15?" --self-consistency 8

# Best-of-N dense verification
swiftllm generate -p "Prove the Pythagorean theorem" --best-of-n 4

# Recursive Language Model (depth=3)
swiftllm generate -p "Solve step by step: 2^10 mod 7" --rlm 3

# Multi-round self-refinement
swiftllm generate -p "Write a haiku about Rust" --refinement-rounds 3

Air-Gapped Installation

SwiftLLM supports deployment on isolated networks with no internet access. Create a portable bundle on a connected machine, then install offline. The bundle includes source, Python wheels, rustup-init, and optionally pre-downloaded models with SHA256 checksum verification.

Shell
# On a connected machine — basic bundle
./airgap-bundle.sh

# Include a model in the bundle
./airgap-bundle.sh --model "Qwen/Qwen2.5-0.5B-Instruct-GGUF:qwen2.5-0.5b-instruct-q4_k_m.gguf"

# Cross-architecture bundle (x86_64 host → ARM64 target)
./airgap-bundle.sh --arch aarch64 -o swiftllm-bundle-arm64.tar.gz

# CPU-only wheels
./airgap-bundle.sh --cpu -o /mnt/usb/swiftllm-bundle.tar.gz

# macOS Apple Silicon
./airgap-bundle.sh --arch arm64 --platform macosx_11_0_arm64
Shell
# On the air-gapped target
tar xzf swiftllm-airgap-bundle.tar.gz
cd swiftllm-airgap-bundle/swiftllm
./install.sh --airgap

# Runtime offline mode (disable all HF downloads)
export SWIFTLLM_OFFLINE=1
swiftllm generate -m /path/to/local/model.gguf -p "Hello"

The --arch flag auto-maps to the correct pip platform tag and rustup target triple for cross-architecture bundles. Supported targets: x86_64 Linux/macOS, aarch64 Linux, arm64 macOS (Apple Silicon).

Update

The update.sh lifecycle script pulls the latest source, rebuilds the wheel, and reinstalls the package — preserving your virtual environment and downloaded models.

bash
# Pull latest and rebuild
./update.sh

# Switch to a specific branch
./update.sh --branch main-hybrid-rd

# Switch to a specific tag
./update.sh --tag v2.0.5

# Clean build artifacts before rebuilding
./update.sh --clean

# Rebuild from current source (skip git pull)
./update.sh --no-pull

# Force CPU-only rebuild
./update.sh --cpu

Update Flags

Flag Description
--branch NAME Switch to a specific git branch before rebuilding
--tag TAG Check out a specific release tag
--clean Remove target/ build artifacts before rebuilding
--no-pull Rebuild from current source without running git pull
--cpu / --gpu Force CPU-only or GPU/CUDA rebuild

Uninstall

The uninstall.sh lifecycle script cleanly removes SwiftLLM, with options to preserve models and virtual environments.

bash
# Interactive uninstall (prompts before each step)
./uninstall.sh

# Uninstall but keep downloaded models
./uninstall.sh --keep-models

# Remove everything non-interactively
./uninstall.sh --purge --yes

Uninstall Flags

Flag Description
--keep-models Preserve downloaded models in ~/.cache/swiftllm/models
--keep-venv Preserve the Python virtual environment
--purge Remove everything: package, model cache, venv, and build artifacts
--yes Non-interactive mode — skip all confirmation prompts

Caution: --purge permanently deletes all downloaded models and cached data. Use --keep-models if you want to reinstall later without re-downloading.

Inference

SwiftLLM's inference engine is built on PagedAttention with continuous batching, speculative decoding, and KV-cache reuse — designed to serve dozens of concurrent requests on a single GPU without head-of-line blocking. v2.0 adds six research-derived test-time inference enhancements and TurboQuant KV cache compression.

Sync, async, and streaming

  • LLM — synchronous engine for single-shot generation. Best for scripts and notebooks.
  • AsyncLLM — async engine with async for token streaming. Best for serving and concurrent workloads.
  • OpenAI-compatible HTTPswiftllm serve exposes /v1/completions and /v1/chat/completions with SSE streaming.

Sampling parameters

Parameter Description
temperature 0 = greedy. Higher = more random. Default 1.0.
top_p Nucleus sampling. Default 1.0 (off).
top_k Top-K sampling. Default 0 (off).
max_tokens Cap on generated tokens. Required.
min_p Minimum probability threshold. Default 0 (off).
stop List of strings to halt generation.
presence_penalty / frequency_penalty Discourage repetition. Default 0.

Streaming example

Python
from swiftllm import AsyncLLM, SamplingParams

llm = AsyncLLM("meta-llama/Llama-3-8B")
params = SamplingParams(temperature=0.7, max_tokens=256)

async for token in llm.stream("Explain PagedAttention in one paragraph.", params):
    print(token, end="", flush=True)

Continuous batching means a long-running request never blocks short ones — new requests slot into the next decoder step.

Test-Time Inference Enhancements (Phase 3)

Six research-derived methods for improving output quality at inference time. Each is available as a Python API method and a CLI flag.

Method Python API CLI Flag Description
Self-Consistency llm.generate_with_self_consistency() --self-consistency N Majority voting over N independent chains (Wang et al., 2022). Configurable answer extractor (regex, sentinel, JSON, freeform). Ties broken by mean log-prob.
Self-Refinement llm.generate_with_refinement() --refinement-rounds N Iterative critique-revision loop (Madaan et al., 2023). Stops when normalized edit distance improvement falls below threshold.
Best-of-N llm.generate_best_of_n() --best-of-n N Generates N candidates, scores via rule-based, neural PRM, ensemble, or log-prob strategy, returns highest-scoring.
Disaggregated Serving DisaggregatedServingConfig Separates prefill and decode into independent worker pools (Splitwise/DistServe). Round-robin, least-loaded, or locality-aware scheduling.
RLM llm.generate_with_rlm() --rlm DEPTH Recursive Language Model — bounded recursive self-calling with REPL sandbox (Assign/Compute/Verify/Recurse steps), variable binding table, and complexity-classifier MLP.
Dense Verification llm.generate_with_dense_verification() --dense-verification Cross-attention scoring of draft tokens against REPL trace. Per-token and per-step confidence. Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN.

TurboQuant KV Cache Compression

ICLR 2026 (Zandieh et al., arXiv 2504.19874) — online vector quantization for KV cache. Random rotation via fast Walsh-Hadamard transform with deterministic sign flips, then scalar quantization using precomputed Beta-distribution Max-Lloyd codebooks at 1-4 bits per channel.

Preset Key Bits Value Bits Compression Use Case
quality_neutral() 4 3 ~4x Minimal quality loss — recommended default
aggressive() 3 2 ~5x Maximum memory savings — some quality tradeoff
Custom 1-4 1-4 Varies Per-use-case tuning via TurboQuantConfig

Two quantization variants: TurboQuantMse (minimizes MSE, best for general use) and TurboQuantProd (unbiased inner-product estimation via JL sign sketch, best when attention score preservation matters more than individual vector reconstruction).

Recursive Language Model (RLM)

Bounded recursive self-calling with a REPL sandbox, variable binding table, and complexity-classifier MLP. Four operating modes: DISABLED, SHALLOW (depth=1), REASONING (depth=3, default), AGENTIC (depth=5).

Python
from swiftllm import LLM, RlmConfig, RlmMode, SamplingParams

llm = LLM(model="path/to/model")
results = llm.generate_with_rlm(
    "Prove by induction that 1+2+…+n = n(n+1)/2.",
    config=RlmConfig(mode=RlmMode.REASONING, max_depth=3,
                     enable_repl=True, early_exit_threshold=0.92),
    base_params=SamplingParams(temperature=0.7, max_tokens=768),
)
result = results[0]
print(f"Depth used: {result.recursion_depth_used}")
print(f"REPL steps: {len(result.repl_trace)}")

Dense Verification Layer

Cross-attention scoring of draft tokens against REPL execution trace. Per-token and per-step confidence. Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN.

Python
from swiftllm import LLM, DenseVerificationConfig, VerificationStrategy

llm = LLM(model="path/to/model")
results = llm.generate_with_dense_verification(
    "Explain Gödel's incompleteness theorems.",
    config=DenseVerificationConfig(
        strategy=VerificationStrategy.GATE_AND_REGEN,
        min_confidence=0.80, max_regen_attempts=3),
)
result = results[0]
print(f"Confidence: {result.global_score:.1%}")
print(f"Accepted on attempt: {result.accepted_on_attempt}")

API Server

SwiftLLM includes an OpenAI-compatible HTTP server built with Axum, supporting streaming, API key authentication, CORS, and security hardening. As of v2.0 the server binds to 127.0.0.1 by default (localhost only) to prevent accidental exposure on open networks.

Endpoints

  • POST /v1/completions — Text completion
  • POST /v1/chat/completions — Chat completion
  • GET /v1/models — List loaded models
  • GET /health — Health check (unauthenticated)
  • GET /metrics — Prometheus metrics (authenticated — requires API key when SWIFTLLM_API_KEY is set)

Security Hardening

v2.0 ships with production-grade security defaults out of the box:

  • Default bind 127.0.0.1 — the server listens on localhost only. Set SWIFTLLM_HOST=0.0.0.0 to expose externally.
  • Constant-time API key auth — set SWIFTLLM_API_KEY to require Authorization: Bearer <key> on all endpoints except /health. Uses constant-time comparison to prevent timing attacks.
  • CORS from config — allowed origins are configured via ServerConfig.cors_origins or SWIFTLLM_CORS_ORIGINS. Defaults to same-origin only.
  • Authenticated /metrics — Prometheus metrics are gated behind the same API key, preventing info leakage.
  • Rate limiting — configurable via tower-http middleware. Defaults to no limit; set SWIFTLLM_RATE_LIMIT for requests/second cap.

Launch examples

bash
# Basic — localhost only, no auth
swiftllm serve --model meta-llama/Llama-3-8B

# Production — external, with API key and CORS
export SWIFTLLM_API_KEY="sk-your-secret-key"
export SWIFTLLM_CORS_ORIGINS="https://app.example.com"
swiftllm serve --model meta-llama/Llama-3-8B --host 0.0.0.0 --port 8000

# Verify auth is required
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer sk-your-secret-key"

The server is a drop-in replacement for the OpenAI API. Point any OpenAI client (Python, Node, curl) at http://localhost:8000/v1 and it just works — including SSE streaming for chat completions.

Training & Fine-Tuning

SwiftLLM supports full training, LoRA, QLoRA fine-tuning, and GRPO reinforcement learning with three optimizer backends: Muon (Newton-Schulz orthogonalization), AdamW (decoupled weight decay), and SGD (optional Nesterov momentum). The CLI exposes subcommands: swiftllm train, swiftllm finetune, and swiftllm grpo.

Simulated backend: CUDA training kernels are not yet wired up. Every training run prints a visible [SIMULATED] banner. Real gradient training ships in a later release.

Supported Methods

Method Description Memory
LoRA Low-Rank Adaptation — trains small adapter matrices Low
QLoRA 4-bit quantized base model + LoRA adapters (~65-70% less VRAM) Very Low
Full Full parameter fine-tuning High

LoRA Fine-Tuning

Python
from swiftllm import fine_tune, LoRAConfig, TrainingConfig

lora = LoRAConfig(r=16, alpha=32, dropout=0.05)
config = TrainingConfig(learning_rate=2e-4, num_epochs=3,
                        per_device_batch_size=4)

fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_data.jsonl",
          lora_config=lora, training_config=config,
          output_dir="./output")
bash
# CLI — LoRA fine-tuning
swiftllm finetune -m meta-llama/Llama-3-8B \
  --train-data ./data/train.jsonl \
  --lora-r 16 --lora-alpha 32 --learning-rate 2e-4

# CLI — QLoRA (4-bit base + LoRA)
swiftllm train -m meta-llama/Llama-3-8B \
  --train-data ./data/train.jsonl --method qlora \
  --lora-r 16 --mixed-precision bf16

Auto-Ingest from Files or HuggingFace

Pass a directory, file list, or HuggingFace dataset directly to fine_tune() — SwiftLLM automatically converts it to JSONL before training. No manual dataset preparation needed.

Python
# Fine-tune from a directory (auto-ingested to JSONL)
fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_corpus/",
          output_dir="./output", lora_r=16)

# Fine-tune from a HuggingFace dataset
fine_tune(model="meta-llama/Llama-3-8B",
          hf_dataset="tatsu-lab/alpaca",
          dataset_format="sft_completion",
          output_dir="./output", lora_r=32)

# Combine local files + HuggingFace in one command
fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_docs/",
          hf_dataset="tatsu-lab/alpaca",
          hf_max_samples=10_000,
          dataset_format="sft_completion",
          output_dir="./output", lora_r=16)

GRPO Reinforcement Learning

Group Relative Policy Optimization — RL fine-tuning without a critic model. Includes CGAR depth curriculum (1.71x speedup), Process Reward Models (5 aggregation strategies for step-level feedback), and LongR dense token-level rewards (+9% LongBench v2).

Python
from swiftllm import GrpoTrainer, TrainingConfig
from swiftllm.config import GrpoConfig, CgarConfig, PrmConfig

config = TrainingConfig(
    model="meta-llama/Llama-3-8B",
    train_data="./data/math_prompts.jsonl",
    output_dir="./grpo_output",
    num_layers=32,
    grpo=GrpoConfig(group_size=8, kl_coeff=0.04),
    cgar=CgarConfig(),
    prm=PrmConfig(aggregation="last_step"),
)
trainer = GrpoTrainer(config)
trainer.train()
bash
# GRPO with full research stack: CGAR + PRM + LongR
swiftllm grpo -m meta-llama/Llama-3-8B \
  --train-data ./data/math_prompts.jsonl \
  --group-size 8 --enable-prm --long-reward-weight 0.1 \
  --num-layers 32

Dataset Ingestion

Convert directories of text, code, PDF, DOCX, CSV, HTML, and JSON files into JSONL training data in one command. Supports HuggingFace Hub datasets via HuggingFaceSource. Four output schemas: pretraining, sft_messages, sft_completion, code. SHA-256 deduplication across all sources.

bash
# Local files — code fine-tuning
swiftllm dataset --input ./src/ ./docs/ --output train.jsonl --format code

# HuggingFace dataset
swiftllm dataset --hf-dataset tatsu-lab/alpaca --output alpaca.jsonl --format sft_completion

# Mixed: local files + HuggingFace combined
swiftllm dataset --input ./my_docs/ --hf-dataset tatsu-lab/alpaca \
  --hf-max-samples 10000 --format sft_completion --output combined.jsonl

# Large corpus with streaming (avoids full download)
swiftllm dataset --hf-dataset HuggingFaceFW/fineweb --hf-subset sample-10BT \
  --hf-streaming --hf-max-samples 100000 --format pretraining --output fineweb.jsonl

Training Data Formats

Three accepted input formats for supervised fine-tuning:

Format Schema Use Case
Messages (JSONL) {"messages": [{"role": "...", "content": "..."}]} Chat / instruction tuning (recommended)
Prompt-Completion (JSONL) {"prompt": "...", "completion": "..."} Classic SFT, GRPO (add "answer" for RL)
Plain Text / CSV One example per line or CSV with prompt/completion columns Language modelling pretraining

Advanced Inference

SwiftLLM v2.0 includes six test-time inference enhancements accessible via Python API and CLI flags. These improve output quality by spending more compute at inference time — no training required.

  • Self-Consistency: llm.generate_with_self_consistency() — majority voting over N independent chains (Wang et al., 2022)
  • Self-Refinement: llm.generate_with_refinement() — iterative critique→revision cycles (Madaan et al., 2023)
  • Best-of-N: llm.generate_best_of_n() — dense scoring and reranking of N candidates
  • Disaggregated Serving: DisaggregatedServingConfig — separate prefill/decode worker pools (Splitwise/DistServe)
  • RLM: llm.generate_with_rlm() — recursive self-calling with REPL sandbox and variable binding
  • Dense Verification: llm.generate_with_dense_verification() — cross-attention token/step confidence scoring
  • TurboQuant: Set EngineConfig(turbo_quant=TurboQuantConfig.quality_neutral()) for 3-5x KV cache compression

CLI Commands

The swiftllm CLI provides subcommands covering inference, serving, training, dataset ingestion, and model management.

serve

OpenAI-compatible API server

generate

Batch text generation with --self-consistency, --refinement-rounds, --best-of-n, --rlm, --dense-verification flags

chat

Interactive REPL

download

Pull models from HF Hub

benchmark

Throughput & TTFT tests

convert

HF ↔ SafeTensors ↔ GGUF

info

Model and system info

train

Full / LoRA / QLoRA training

finetune

LoRA convenience command

grpo

GRPO RL training with CGAR/PRM/LongR

dataset

Multi-format ingestion to JSONL with HuggingFace support

Global Flags

Flag Description
--model HF repo id, local path, or GGUF file
--dtype auto, float16, bfloat16, float32, int8, int4
--quantization awq, gptq, bnb4, turboquant, or none
--tensor-parallel Number of GPUs for tensor parallelism
--max-model-len Override the model's context length

Python API

Inference

Export Description
LLM Synchronous inference engine
AsyncLLM Async engine with streaming iterators
SamplingParams temperature, top_p, top_k, max_tokens, stop strings
RequestOutput Per-request result wrapper
EngineConfig gpu_memory_utilization, dtype, quantization, max_model_len

Training

Export Description
Trainer High-level training loop with callbacks, early stopping, checkpointing
TrainingConfig learning_rate, num_epochs, per_device_batch_size, lr_scheduler, mixed_precision
LoRAConfig r, alpha, dropout, target_modules, use_rslora
fine_tune One-call convenience — accepts JSONL, directory, or HuggingFace dataset
GrpoTrainer GRPO RL training with CGAR curriculum integration
GrpoConfig group_size, kl_coeff, clip_eps, correctness/format/length weights
CgarConfig Curriculum-Guided Adaptive Recursion — shallow_end, medium_end phases
PrmConfig Process Reward Model — 5 aggregation strategies (Min/Mean/Product/LastStep/WeightedMean)
LongRewardConfig LongR dense token-level rewards — weight, aggregation, normalise
grpo_train One-call GRPO convenience function

Dataset Ingestion

Export Description
DatasetIngester Full-control multi-format dataset ingestion API
IngestionConfig input_paths, output_path, format, chunk_size, hf_sources
IngestionResult Ingestion summary — chunks, files, HF rows, skipped files
HuggingFaceSource Describes one HuggingFace Hub dataset with auto-detected schema
DatasetFormat Output format enum — pretraining, sft_messages, sft_completion, code
ingest_dataset One-liner for local files, HuggingFace, or combined ingestion
prepare_dataset Convenience shortcut available as swiftllm.prepare_dataset()

Phase 3 — Inference Configs

Export Description
SelfConsistencyConfig num_samples, extractor strategy, temperature
RefinementConfig max_rounds, stopping criterion, improvement metric
VerificationConfig num_candidates, scoring strategy, threshold
DisaggregatedServingConfig Separate prefill/decode worker pool config
RlmConfig max_depth, mode (DISABLED/SHALLOW/REASONING/AGENTIC)
DenseVerificationConfig min_confidence, strategy (SCORE_ONLY/GATE/GATE_AND_REGEN)
TurboQuantConfig key_bits, value_bits, presets: quality_neutral(), aggressive()

Hybrid Architecture (Phase 1)

Export Description
HybridModelBuilder Fluent builder for hybrid Mamba-3 models
build_mamba3_reasoning_model Mamba-3 + LatentMoE + RLM + Dense Verification preset
HybridModelConfig Top-level architecture configuration
MambaConfig Mamba-3 SSM layer configuration
LatentMoeConfig Latent MoE FFN configuration (87.5% less inter-GPU traffic)
estimate_parameters Compute total parameter count for a HybridModelConfig
parameter_summary Formatted breakdown of parameter counts by component

Configuration

Every SWIFTLLM_* environment variable maps to a field in EngineConfig or ServerConfig. Set them in your shell profile, systemd unit, or Docker Compose file — they are read at startup and override coded defaults. Explicit constructor arguments and CLI flags always take final precedence.

GPU & Memory

Variable Default Description
SWIFTLLM_GPU_MEMORY_UTILIZATION 0.90 Fraction of GPU VRAM for model weights + KV cache (0.0–1.0). Raise to ~0.95 on dedicated hosts.
SWIFTLLM_GPU_OVERHEAD_MB 0 VRAM (in MB) to reserve for the OS and other processes.
SWIFTLLM_NUM_GPU_LAYERS all Number of layers to offload to GPU. 0 = CPU-only, 999 = all.
SWIFTLLM_SWAP_SPACE 4.0 CPU swap space in GiB for KV cache offloading.
SWIFTLLM_CPU_OFFLOAD_GB 0.0 Model weight gigabytes to keep on CPU RAM instead of GPU.
SWIFTLLM_KV_CACHE_DTYPE auto Data type for the KV cache. fp8_e4m3/fp8_e5m2 halves memory.
SWIFTLLM_BLOCK_SIZE 16 Tokens per PagedAttention block. Allowed: 8, 16, 32.
SWIFTLLM_FLASH_ATTENTION true Enable FlashAttention kernels.
SWIFTLLM_ENFORCE_EAGER false Disable CUDA graph capture; use eager execution.
CUDA_VISIBLE_DEVICES (all) Restrict which GPUs are visible, e.g. 0,2.

Tensor Parallelism & Multi-GPU

Variable Default Description
SWIFTLLM_TENSOR_PARALLEL_SIZE 1 GPUs for tensor parallelism. Must evenly divide attention heads.
SWIFTLLM_PIPELINE_PARALLEL_SIZE 1 Pipeline-parallel stages.
NCCL_DEBUG NCCL logging level: INFO, WARN, TRACE.
NCCL_P2P_DISABLE 0 Set to 1 to disable GPU peer-to-peer on certain PCIe topologies.

Scheduling & Batching

Variable Default Description
SWIFTLLM_MAX_NUM_SEQS 256 Maximum concurrent sequences in a batch.
SWIFTLLM_MAX_NUM_BATCHED_TOKENS 8192 Maximum total tokens per forward pass.
SWIFTLLM_MAX_PADDINGS 256 Maximum padding tokens tolerated per batch.
SWIFTLLM_SCHEDULER_POLICY fcfs fcfs, sjf, or priority.
SWIFTLLM_PREEMPTION_MODE swap swap (KV cache to CPU) or recompute.
SWIFTLLM_ENABLE_PREFIX_CACHING false Reuse KV cache across requests sharing the same prefix.
SWIFTLLM_ENABLE_CHUNKED_PREFILL false Interleave prefill and decode; reduces time-to-first-token.
SWIFTLLM_NUM_PARALLEL 1 Parallel inference slots per model.
SWIFTLLM_MAX_LOADED_MODELS 1 Models held in GPU memory simultaneously.
SWIFTLLM_KEEP_ALIVE 300 Seconds a model stays loaded after last request.

Speculative Decoding

Variable Default Description
SWIFTLLM_SPECULATIVE_MODEL Draft model for speculative decoding.
SWIFTLLM_NUM_SPECULATIVE_TOKENS 5 Tokens to draft per step.
SWIFTLLM_SPECULATIVE_MAX_MODEL_LEN Override max sequence length for the draft model.

Model & Weights

Variable Default Description
SWIFTLLM_MODEL_DIR ~/.cache/swiftllm/models Default directory for downloaded models.
SWIFTLLM_OFFLINE false Set to 1 to disable all network downloads (air-gapped mode).
SWIFTLLM_DTYPE auto Weight data type: auto, float16, bfloat16, float32, int8, int4, fp8_e4m3, fp8_e5m2.
SWIFTLLM_QUANTIZATION none Quantization method: none, awq, gptq, squeezellm, gguf.
SWIFTLLM_MAX_MODEL_LEN (model default) Override the model's max sequence length.
SWIFTLLM_TRUST_REMOTE_CODE false Allow executing custom code from HuggingFace repos.
SWIFTLLM_DEVICE auto Device: auto, cuda, cpu, metal, rocm.
SWIFTLLM_SEED 0 Global random seed.
HF_TOKEN HuggingFace API token for gated models.

LoRA

Variable Default Description
SWIFTLLM_ENABLE_LORA false Enable LoRA adapter support in the inference engine.
SWIFTLLM_MAX_LORAS 1 Maximum LoRA adapters loaded simultaneously.
SWIFTLLM_MAX_LORA_RANK 16 Maximum LoRA rank.

Server & Networking

Variable Default Description
SWIFTLLM_HOST 127.0.0.1 HTTP bind address. Localhost by default for security; set 0.0.0.0 to expose externally.
SWIFTLLM_PORT 8000 Listen port.
SWIFTLLM_API_KEY Bearer-token API key (constant-time comparison).
SWIFTLLM_CORS_ALLOW_ORIGINS * Comma-separated allowed CORS origins.
SWIFTLLM_SSL_CERTFILE Path to TLS certificate for HTTPS.
SWIFTLLM_SSL_KEYFILE Path to TLS private key for HTTPS.
SWIFTLLM_ROOT_PATH URL prefix for reverse-proxy deployments.
SWIFTLLM_SERVED_MODEL_NAME Override model name in API responses.
SWIFTLLM_MAX_LOG_LEN Truncate request/response logs to this many characters.
SWIFTLLM_RESPONSE_ROLE assistant Default role name in chat completion responses.
SWIFTLLM_METRICS_ENABLED true Expose Prometheus /metrics (authenticated when API key is set).

Build & CUDA

Variable Default Description
CUDA_PATH / CUDA_HOME Path to CUDA toolkit.
CUDACXX Path to nvcc binary.
CMAKE_ARGS Extra CMake arguments for llama-cpp-python build.

Logging & Debug

Variable Default Description
RUST_LOG info Rust log level: trace, debug, info, warn, error.
SWIFTLLM_LOG_LEVEL Python log level: DEBUG, INFO, WARNING, ERROR.

Supported Models

Ten model families supported out of the box, with auto-detection for HuggingFace, GGUF, SafeTensors, and PyTorch formats. Phase 1 research models (Mamba, Jamba) are available when building with hybrid architecture support.

Model Family Variants Notes
LLaMA 3 8B, 70B, 405B Meta's latest flagship. Full tensor-parallel support for 70B/405B.
Code Llama 7B, 13B, 34B Code-specialized. Python, Instruct, and base variants.
Mistral / Mixtral / Devstral 7B, 8x7B, Devstral Dense (7B), MoE (Mixtral 8x7B), and code-agent (Devstral) variants.
Qwen 3 / 3.5 0.6B – 235B Alibaba multilingual. MoE and dense variants.
Phi-3 / Phi-4 Mini, Small, Medium Microsoft small-language models optimized for edge deployment.
Deepseek R1 / V4 R1-Distill, V4-0324 Reasoning-focused. R1 chain-of-thought and V4 MoE architectures.
Gemma 2B, 7B Google open models. Instruct and base variants.
Mamba 130M – 2.8B Phase 1 — SSM backbone. Requires hybrid architecture build.
Jamba Hybrid configs Phase 1 — Mamba + Attention + MoE hybrid (AI21 architecture).

All models support HuggingFace (SafeTensors/PyTorch), GGUF, and raw SafeTensors formats. Format is auto-detected by file extension. Use swiftllm convert to move between formats.

Get help

Three channels, tuned to different needs. Community-first, paid support available for production deployments.

Frequently asked questions

Do I need a GPU to run SwiftLLM?
No. As of v2.0 CPU is the default build target. The installer only pulls CUDA libraries if you pass --gpu or run on a machine with nvidia-smi detected. CPU wheels ship for x86_64 and aarch64 on both Linux and macOS, plus arm64 on Apple Silicon.
Which model formats are supported?
HuggingFace (SafeTensors and PyTorch), GGUF (via llama-cpp-python), and raw SafeTensors files. Format is auto-detected by file extension. Use swiftllm convert to move between formats.
Is training production-ready yet?
The full training pipeline — CLI, config parsing, checkpointing, dataset loading, GRPO RL, CGAR curriculum, PRM scoring, and LongR dense rewards — is wired up and usable. However, the actual CUDA gradient kernels are still stubbed; runs print a visible [SIMULATED] banner. Real GPU gradient training is planned for a future release — follow the changelog for updates.
Can I deploy on air-gapped networks?
Yes. Run airgap-bundle.sh on a connected machine to build a portable tarball with source, Python wheels, the Rust installer, and optional pre-downloaded models. SHA256 checksums verify integrity on the target.
How does throughput compare to vLLM?
Run the included benchmark suite on your own hardware — swiftllm benchmark --concurrency 32 --num-requests 1000. We consistently see competitive tokens/sec on LLaMA-3-8B with PagedAttention + continuous batching enabled. See the changelog for tested throughput per release.
Is the OpenAI API really a drop-in replacement?
Yes for /v1/completions and /v1/chat/completions including streaming via SSE. Point any OpenAI client (Python, Node, curl) at http://localhost:8000/v1 and it just works. API key auth, CORS, and Prometheus metrics are built in.

Troubleshooting

Installer fails with "externally-managed environment"

You're on Debian 12+, Ubuntu 23.04+, or recent macOS. Let the installer create a venv (the default), pass --venv ~/.swiftllm, or activate your own environment before running.

nvidia-smi detected but CUDA build fails

Ensure nvcc is on your PATH (which nvcc). Driver version must be compatible with CUDA 11.8+. On Ubuntu, the nvidia-cuda-toolkit package is often too old — install CUDA directly from NVIDIA.

Training shows [SIMULATED]

This is expected behaviour — the training loop is a simulation stub. Full GPU training requires PyTorch and a CUDA environment. See the Training section.

Model downloads hang

Set HF_HUB_ENABLE_HF_TRANSFER=1 for the accelerated downloader, or pass --revision main to pin the reference. For air-gapped installs, set SWIFTLLM_OFFLINE=1 and pre-populate the cache.

Release history

Track every update, feature, and fix as SwiftLLM evolves.

v2.0
Latest Stable 2026

Stable release — all beta features promoted to production. Hybrid Mamba-3, GRPO training, TurboQuant KV cache compression, and advanced test-time inference.

  • AddedHybrid Mamba-3 SSM + Transformer + MoE architecture (Phase 1) — selective SSM with MIMO multi-head scan, LatentMoE with dynamic-bias load balancing, Jamba-style hybrid blocks
  • AddedGRPO / CGAR / PRM / LongReward training pipelines (Phase 2) — reinforcement learning fine-tuning without a critic model
  • AddedSelf-consistency, refinement, best-of-N, disaggregated serving, Recursive Language Model, and Dense Verification (Phase 3)
  • AddedTurboQuant KV cache compression (ICLR 2026) — Walsh-Hadamard rotation + Max-Lloyd codebooks for 3-5x memory reduction
  • AddedLifecycle scripts (install.sh, update.sh, uninstall.sh) for full install management
  • AddedComprehensive static analysis sweep — zero Clippy warnings, logic bug fixes
  • Added329 backend tests and 66 frontend tests — all passing
IB
v2.0.6-beta
Beta 2026

TurboQuant KV cache compression — online vector quantization from ICLR 2026.

  • AddedFull TurboQuant implementation (Zandieh et al., arXiv 2504.19874) — random rotation via fast Walsh-Hadamard transform, precomputed Max-Lloyd codebooks for 1-4 bit scalar quantization
  • AddedTurboQuantMse (MSE-optimal) and TurboQuantProd (unbiased inner products via JL sign sketch) variants
  • AddedTurboQuantConfig in Rust and Python with quality_neutral() and aggressive() presets
  • AddedTurboQuantKvCache — compressed KV cache layer with slot-based store/load/clear
  • Added25 Rust + 16 Python tests covering codebook construction, rotation invertibility, roundtrip quality, memory stats
IB
v2.0.5-beta
Beta 2026

Code quality & zero-warning build, lifecycle scripts, comprehensive static analysis and bug fixes.

  • FixedResolved all 130+ compiler warnings across 4 workspace crates — dead code annotations, doc comments, unused imports
  • FixedLogic bug in CurriculumScheduler::ssm_lr_scale() — identical if/else branches; replaced with correct linear inverse mapping
  • FixedIncorrect #[cfg(has_cuda)] in PyO3 bindings — changed to proper #[cfg(feature = "cuda")]
  • Addedupdate.sh and uninstall.sh lifecycle scripts with branch/tag switching and purge options
  • FixedPython SDK imports cleanly without PyTorch — torch_model features guarded with try/except ImportError
  • Added304 Rust tests + 50 Python tests — all passing with zero actionable warnings
IB
v2.0.4-beta
Beta 2026

Security audit, CUDA acceleration kernels, and PyTorch integration.

  • SecurityCritical: Added Drop for CUDA storage, check_cuda_last_error() after all kernel launches, constant-time API key comparison
  • SecurityHigh: CORS from config, authenticated /metrics, default bind 127.0.0.1, input validation on legacy endpoints
  • AddedCustom CUDA kernels: mamba3_scan.cu, moe_dispatch.cu, dense_verif_attn.cu, rlm_ops.cu, linear_f16.cu
  • Addedhybrid_model.py and torch_model.py — PyTorch nn.Module bridge for GPU-executable hybrid models
  • FixedInteger overflow in block sizing, TOCTOU race in scheduler, UTF-8 slicing panic in repr
IB
v2.0.3-beta
Beta 2026

Phase 1 hybrid architectures, Phase 2 training enhancements, Phase 3 inference and model-level reasoning.

  • AddedPhase 1: mamba.rs, moe.rs, jamba.rs — Mamba SSM with MIMO scan, sparse MoE routing, Jamba hybrid blocks
  • AddedPhase 2: grpo.rs, curriculum.rs, process_reward.rs, long_reward.rs — GRPO RL, CGAR curriculum, PRM step scoring, LongR dense rewards
  • AddedPhase 3: Self-consistency voting, multi-round self-refinement, best-of-N verification, disaggregated prefill/decode serving
  • AddedPhase 3 Model: rlm.rs (Recursive Language Model with REPL sandbox) and dense_verification.rs (cross-attention draft scoring)
  • Added14 new Python config dataclasses, CLI flags for self-consistency/refinement/best-of-N/RLM/dense-verification, 4 new example scripts
IB
v2.0.2-beta
Beta 2026

HuggingFace dataset support and multi-format dataset ingestion pipeline.

  • AddedHuggingFaceSource — auto-detects Alpaca, ShareGPT, OpenAI messages, prompt/completion, Q&A, and plain text schemas
  • AddedMulti-format ingestion: .txt, .md, .py, .rs, .pdf, .docx, .csv, .html, .jsonl and 40+ more into JSONL
  • Added4 output schemas: pretraining, sft_messages, sft_completion, code — with SHA-256 dedup and code-aware chunking
  • AddedCLI swiftllm dataset subcommand with 14 --hf-* flags; streaming mode for large corpora
  • AddedAuto-ingest in Trainer and fine_tune() — directories and HF datasets converted transparently
IB
v2.0.1-beta
Beta 2026

Training UX fixes — simulated-stub banner, train-data path validation, and full regression coverage.

  • FixedTrainer.train() now prints a visible [SIMULATED] banner making it obvious the current loop is a stub (no weights, no gradients) rather than real training
  • Fixedswiftllm train and swiftllm finetune validate --train-data / --eval-data paths up front — missing, non-regular, or empty files now fail fast with clear errors
  • ChangedValidation covers paths loaded from --config JSON so typos in saved configs surface immediately
  • AddedFull regression matrix: install → download → generate (18.61 tok/s) → finetune (LoRA) → train on Ubuntu 24.04 + CUDA 13.0
v2.0.0.2-alpha
Alpha2026

CPU & ARM wheel support, installer portability, and critical Rust-side safety fixes.

  • AddedCPU-only build is now the default — portable wheel with zero CUDA dependencies, buildable on Apple Silicon, Graviton, Raspberry Pi, Jetson, Ampere
  • AddedTop-level swiftllm crate exposes cpu and cuda Cargo features; CUDA opt-in via ./install.sh --gpu
  • Addedairgap-bundle.sh --arch flag auto-maps to pip platform tags and rustup target triples for cross-architecture bundles
  • Added[serve] optional dependency (fastapi, uvicorn) wired into install.sh and the airgap bundle
  • FixedPEP 668 handling on Ubuntu 23.04+ / Debian 12+ externally-managed Python
  • FixedReplaced non-portable grep -oP with sed for CUDA detection (macOS/BSD compatible)
  • SecurityFixed 14 partial_cmp().unwrap() calls — replaced with unwrap_or(Ordering::Equal) to prevent NaN panics
  • SecurityAdded checked_add bounds validation in GGUF loader — malformed files can no longer cause slice panics
v2.0.0.1-alpha
Alpha2026

Air-gapped installation and comprehensive security hardening.

  • Addedairgap-bundle.sh creates portable install archives (source, pip wheels, rustup-init, optional models)
  • Addedinstall.sh --airgap flag for fully offline installation from a bundle
  • AddedRuntime offline mode via SWIFTLLM_OFFLINE=1 — disables all HF downloads
  • SecurityCritical: Fixed JSON injection in SSE streaming — replaced raw format!() with serde_json::json!()
  • SecurityCritical: Fixed shell injection in airgap-bundle.sh — model names now passed via sys.argv
  • SecurityCritical: Added SHA256 checksum verification for downloaded rustup-init
  • SecurityPath validation against directory traversal; symlink traversal protection in offline cache
  • FixedUse-after-move in engine.rs; usize negation in trainer.rs; missing mut on eval_data
v2.0.0-alpha
Alpha2026

Training & fine-tuning infrastructure, engine optimization, scheduler improvements.

  • Addedswiftllm-training Rust crate with LoRA / QLoRA / full fine-tuning
  • AddedMuon optimizer with Newton-Schulz orthogonalization, plus AdamW and SGD with linear/cosine/constant LR schedulers
  • AddedDataset loading (JSONL/CSV/text) with instruction templates; rolling-window metrics; perplexity
  • AddedCheckpoint save/load with save_total_limit; Python Trainer with callbacks and EarlyStoppingConfig
  • AddedO(n) top-k / beam / logprobs selection via quickselect; /metrics JSON and Prometheus endpoints
  • FixedEliminated redundant read-then-write lock in sampling hot path; numerically stable log-softmax
  • FixedO(n) victim selection for preemption scheduling; gradient clipping and LoRA scaling
v1.0.0
Initial2026

Initial release — Rust core, PagedAttention, continuous batching, OpenAI API, Python SDK.

  • AddedComplete Rust rewrite with 5 modular crates (core, models, cuda, server, training)
  • AddedPagedAttention memory management: block allocator with copy-on-write
  • AddedContinuous batching scheduler with preemption (swap / recompute)
  • AddedToken sampling: greedy, temperature, top-k, top-p, min-p, repetition penalty
  • AddedOpenAI-compatible HTTP API built with Axum; Python SDK (LLM, AsyncLLM, SamplingParams)
  • AddedSpeculative decoding with draft-model acceleration; multi-GPU tensor and pipeline parallelism