Docs — SwiftLLM

Installation

Auto-detect GPU, install Rust, build from source.

Quick Start

Your first generation in under five minutes.

API Server

OpenAI-compatible endpoints with auth and metrics.

Training & Fine-Tuning

LoRA, QLoRA, GRPO RL, dataset ingestion, and full fine-tuning.

Installation

SwiftLLM provides a smart installer that auto-detects your system, installs dependencies, and builds the engine from source. As of v2.0 it ships portable abi3 wheels for Linux (x86_64, aarch64) and macOS (x86_64, arm64) — a single wheel covers Python 3.8 through 3.12.

Prerequisites

Python 3.8+ (3.10+ recommended)
Git and a C toolchain (cc/clang)
NVIDIA GPU with CUDA 11.8+ — optional; pass --cpu for CPU-only builds
Rust toolchain — installed automatically if missing

Shell

# Clone the repository
git clone https://github.com/infrabrew/swiftllm.git
cd swiftllm

# Run the installer (auto-detects CUDA vs CPU)
bash install.sh

Install Options

Flag	Description
`--cpu`	Force CPU-only build, skip CUDA detection
`--gpu`	Force GPU/CUDA build, fail fast if CUDA is missing
`--venv DIR`	Custom virtual environment location
`--no-venv`	Install into current Python environment instead of creating a venv
`--model-dir DIR`	Set model storage directory
`--airgap`	Install from a pre-built offline bundle (no network required)

PEP 668 / externally-managed environments: On Debian 12+ and recent macOS, the installer will refuse to pip-install into the system Python. Let install.sh create a venv, or pass --venv.

Platforms

SwiftLLM ships pre-built wheels for the matrix below. Other platforms fall back to a source build, which the installer handles automatically.

OS	Architecture	Acceleration	Wheel
Linux	x86_64	CUDA 11.8+ / CPU	✓ pre-built
Linux	aarch64	CPU (CUDA via source)	✓ pre-built
macOS	x86_64	CPU	✓ pre-built
macOS	arm64 (Apple Silicon)	CPU + Metal (experimental)	✓ pre-built
Windows	x86_64	CUDA / CPU	WSL2 recommended

GPU requirements

NVIDIA: Compute Capability 7.0+ (Volta and later). Tested on A100, H100, L4, RTX 4090, RTX 3090.
VRAM: 16 GB minimum for 7B models in fp16; 24 GB+ recommended for serving with KV-cache headroom.
Drivers: NVIDIA driver 525+ for CUDA 12, 470+ for CUDA 11.8.

Apple Silicon Metal acceleration is in preview. CPU-only mode is the default on macOS — opt-in with SWIFTLLM_BACKEND=metal.

Quick Start

Basic Inference

Python

from swiftllm import LLM, SamplingParams

llm = LLM("meta-llama/Llama-3-8B")
params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["What is machine learning?"], params)
for output in outputs:
    print(output.text)

Self-Consistency Voting

Python

from swiftllm import LLM
from swiftllm.config import SelfConsistencyConfig

llm = LLM("meta-llama/Llama-3-8B")
results = llm.generate_with_self_consistency(
    "What is 12 x 15?",
    config=SelfConsistencyConfig(num_samples=8),
)
print(results[0].answer)  # majority-voted answer

TurboQuant KV Cache Compression

Python

from swiftllm import LLM, SamplingParams, EngineConfig, TurboQuantConfig

# Enable 3-5x KV cache compression
engine_cfg = EngineConfig(
    turbo_quant=TurboQuantConfig.quality_neutral(),  # 4-bit K, 3-bit V
)
llm = LLM("meta-llama/Llama-3-8B", engine_config=engine_cfg)
outputs = llm.generate(["Summarize this article..."],
    SamplingParams(max_tokens=512))

Start an API Server

Shell

# Start with API key authentication
export SWIFTLLM_API_KEY="your-secret-key"
swiftllm serve --model meta-llama/Llama-3-8B --port 8000

CLI — Advanced Generation

Shell

# Self-consistency with 8 chains
swiftllm generate -p "What is 12 x 15?" --self-consistency 8

# Best-of-N dense verification
swiftllm generate -p "Prove the Pythagorean theorem" --best-of-n 4

# Recursive Language Model (depth=3)
swiftllm generate -p "Solve step by step: 2^10 mod 7" --rlm 3

# Multi-round self-refinement
swiftllm generate -p "Write a haiku about Rust" --refinement-rounds 3

Air-Gapped Installation

SwiftLLM supports deployment on isolated networks with no internet access. Create a portable bundle on a connected machine, then install offline. The bundle includes source, Python wheels, rustup-init, and optionally pre-downloaded models with SHA256 checksum verification.

Shell

# On a connected machine — basic bundle
./airgap-bundle.sh

# Include a model in the bundle
./airgap-bundle.sh --model "Qwen/Qwen2.5-0.5B-Instruct-GGUF:qwen2.5-0.5b-instruct-q4_k_m.gguf"

# Cross-architecture bundle (x86_64 host → ARM64 target)
./airgap-bundle.sh --arch aarch64 -o swiftllm-bundle-arm64.tar.gz

# CPU-only wheels
./airgap-bundle.sh --cpu -o /mnt/usb/swiftllm-bundle.tar.gz

# macOS Apple Silicon
./airgap-bundle.sh --arch arm64 --platform macosx_11_0_arm64

Shell

# On the air-gapped target
tar xzf swiftllm-airgap-bundle.tar.gz
cd swiftllm-airgap-bundle/swiftllm
./install.sh --airgap

# Runtime offline mode (disable all HF downloads)
export SWIFTLLM_OFFLINE=1
swiftllm generate -m /path/to/local/model.gguf -p "Hello"

The --arch flag auto-maps to the correct pip platform tag and rustup target triple for cross-architecture bundles. Supported targets: x86_64 Linux/macOS, aarch64 Linux, arm64 macOS (Apple Silicon).

Update

The update.sh lifecycle script pulls the latest source, rebuilds the wheel, and reinstalls the package — preserving your virtual environment and downloaded models.

bash

# Pull latest and rebuild
./update.sh

# Switch to a specific branch
./update.sh --branch main-hybrid-rd

# Switch to a specific tag
./update.sh --tag v2.0.5

# Clean build artifacts before rebuilding
./update.sh --clean

# Rebuild from current source (skip git pull)
./update.sh --no-pull

# Force CPU-only rebuild
./update.sh --cpu

Update Flags

Flag	Description
`--branch NAME`	Switch to a specific git branch before rebuilding
`--tag TAG`	Check out a specific release tag
`--clean`	Remove `target/` build artifacts before rebuilding
`--no-pull`	Rebuild from current source without running `git pull`
`--cpu` / `--gpu`	Force CPU-only or GPU/CUDA rebuild

Uninstall

The uninstall.sh lifecycle script cleanly removes SwiftLLM, with options to preserve models and virtual environments.

bash

# Interactive uninstall (prompts before each step)
./uninstall.sh

# Uninstall but keep downloaded models
./uninstall.sh --keep-models

# Remove everything non-interactively
./uninstall.sh --purge --yes

Uninstall Flags

Flag	Description
`--keep-models`	Preserve downloaded models in `~/.cache/swiftllm/models`
`--keep-venv`	Preserve the Python virtual environment
`--purge`	Remove everything: package, model cache, venv, and build artifacts
`--yes`	Non-interactive mode — skip all confirmation prompts

Caution: --purge permanently deletes all downloaded models and cached data. Use --keep-models if you want to reinstall later without re-downloading.

Inference

SwiftLLM's inference engine is built on PagedAttention with continuous batching, speculative decoding, and KV-cache reuse — designed to serve dozens of concurrent requests on a single GPU without head-of-line blocking. v2.0 adds six research-derived test-time inference enhancements and TurboQuant KV cache compression.

Sync, async, and streaming

LLM — synchronous engine for single-shot generation. Best for scripts and notebooks.
AsyncLLM — async engine with async for token streaming. Best for serving and concurrent workloads.
OpenAI-compatible HTTP — swiftllm serve exposes /v1/completions and /v1/chat/completions with SSE streaming.

Sampling parameters

Parameter	Description
`temperature`	0 = greedy. Higher = more random. Default 1.0.
`top_p`	Nucleus sampling. Default 1.0 (off).
`top_k`	Top-K sampling. Default 0 (off).
`max_tokens`	Cap on generated tokens. Required.
`min_p`	Minimum probability threshold. Default 0 (off).
`stop`	List of strings to halt generation.
`presence_penalty` / `frequency_penalty`	Discourage repetition. Default 0.

Streaming example

Python

from swiftllm import AsyncLLM, SamplingParams

llm = AsyncLLM("meta-llama/Llama-3-8B")
params = SamplingParams(temperature=0.7, max_tokens=256)

async for token in llm.stream("Explain PagedAttention in one paragraph.", params):
    print(token, end="", flush=True)

Continuous batching means a long-running request never blocks short ones — new requests slot into the next decoder step.

Test-Time Inference Enhancements (Phase 3)

Six research-derived methods for improving output quality at inference time. Each is available as a Python API method and a CLI flag.

Method	Python API	CLI Flag	Description
Self-Consistency	`llm.generate_with_self_consistency()`	`--self-consistency N`	Majority voting over N independent chains (Wang et al., 2022). Configurable answer extractor (regex, sentinel, JSON, freeform). Ties broken by mean log-prob.
Self-Refinement	`llm.generate_with_refinement()`	`--refinement-rounds N`	Iterative critique-revision loop (Madaan et al., 2023). Stops when normalized edit distance improvement falls below threshold.
Best-of-N	`llm.generate_best_of_n()`	`--best-of-n N`	Generates N candidates, scores via rule-based, neural PRM, ensemble, or log-prob strategy, returns highest-scoring.
Disaggregated Serving	`DisaggregatedServingConfig`	—	Separates prefill and decode into independent worker pools (Splitwise/DistServe). Round-robin, least-loaded, or locality-aware scheduling.
RLM	`llm.generate_with_rlm()`	`--rlm DEPTH`	Recursive Language Model — bounded recursive self-calling with REPL sandbox (Assign/Compute/Verify/Recurse steps), variable binding table, and complexity-classifier MLP.
Dense Verification	`llm.generate_with_dense_verification()`	`--dense-verification`	Cross-attention scoring of draft tokens against REPL trace. Per-token and per-step confidence. Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN.

TurboQuant KV Cache Compression

ICLR 2026 (Zandieh et al., arXiv 2504.19874) — online vector quantization for KV cache. Random rotation via fast Walsh-Hadamard transform with deterministic sign flips, then scalar quantization using precomputed Beta-distribution Max-Lloyd codebooks at 1-4 bits per channel.

Preset	Key Bits	Value Bits	Compression	Use Case
`quality_neutral()`	4	3	~4x	Minimal quality loss — recommended default
`aggressive()`	3	2	~5x	Maximum memory savings — some quality tradeoff
Custom	1-4	1-4	Varies	Per-use-case tuning via `TurboQuantConfig`

Two quantization variants: TurboQuantMse (minimizes MSE, best for general use) and TurboQuantProd (unbiased inner-product estimation via JL sign sketch, best when attention score preservation matters more than individual vector reconstruction).

Recursive Language Model (RLM)

Bounded recursive self-calling with a REPL sandbox, variable binding table, and complexity-classifier MLP. Four operating modes: DISABLED, SHALLOW (depth=1), REASONING (depth=3, default), AGENTIC (depth=5).

Python

from swiftllm import LLM, RlmConfig, RlmMode, SamplingParams

llm = LLM(model="path/to/model")
results = llm.generate_with_rlm(
    "Prove by induction that 1+2+…+n = n(n+1)/2.",
    config=RlmConfig(mode=RlmMode.REASONING, max_depth=3,
                     enable_repl=True, early_exit_threshold=0.92),
    base_params=SamplingParams(temperature=0.7, max_tokens=768),
)
result = results[0]
print(f"Depth used: {result.recursion_depth_used}")
print(f"REPL steps: {len(result.repl_trace)}")

Dense Verification Layer

Cross-attention scoring of draft tokens against REPL execution trace. Per-token and per-step confidence. Strategies: SCORE_ONLY, GATE, GATE_AND_REGEN.

Python

from swiftllm import LLM, DenseVerificationConfig, VerificationStrategy

llm = LLM(model="path/to/model")
results = llm.generate_with_dense_verification(
    "Explain Gödel's incompleteness theorems.",
    config=DenseVerificationConfig(
        strategy=VerificationStrategy.GATE_AND_REGEN,
        min_confidence=0.80, max_regen_attempts=3),
)
result = results[0]
print(f"Confidence: {result.global_score:.1%}")
print(f"Accepted on attempt: {result.accepted_on_attempt}")

API Server

SwiftLLM includes an OpenAI-compatible HTTP server built with Axum, supporting streaming, API key authentication, CORS, and security hardening. As of v2.0 the server binds to 127.0.0.1 by default (localhost only) to prevent accidental exposure on open networks.

Endpoints

POST /v1/completions — Text completion
POST /v1/chat/completions — Chat completion
GET /v1/models — List loaded models
GET /health — Health check (unauthenticated)
GET /metrics — Prometheus metrics (authenticated — requires API key when SWIFTLLM_API_KEY is set)

Security Hardening

v2.0 ships with production-grade security defaults out of the box:

Default bind 127.0.0.1 — the server listens on localhost only. Set SWIFTLLM_HOST=0.0.0.0 to expose externally.
Constant-time API key auth — set SWIFTLLM_API_KEY to require Authorization: Bearer <key> on all endpoints except /health. Uses constant-time comparison to prevent timing attacks.
CORS from config — allowed origins are configured via ServerConfig.cors_origins or SWIFTLLM_CORS_ORIGINS. Defaults to same-origin only.
Authenticated /metrics — Prometheus metrics are gated behind the same API key, preventing info leakage.
Rate limiting — configurable via tower-http middleware. Defaults to no limit; set SWIFTLLM_RATE_LIMIT for requests/second cap.

Launch examples

bash

# Basic — localhost only, no auth
swiftllm serve --model meta-llama/Llama-3-8B

# Production — external, with API key and CORS
export SWIFTLLM_API_KEY="sk-your-secret-key"
export SWIFTLLM_CORS_ORIGINS="https://app.example.com"
swiftllm serve --model meta-llama/Llama-3-8B --host 0.0.0.0 --port 8000

# Verify auth is required
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer sk-your-secret-key"

The server is a drop-in replacement for the OpenAI API. Point any OpenAI client (Python, Node, curl) at http://localhost:8000/v1 and it just works — including SSE streaming for chat completions.

Training & Fine-Tuning

SwiftLLM supports full training, LoRA, QLoRA fine-tuning, and GRPO reinforcement learning with three optimizer backends: Muon (Newton-Schulz orthogonalization), AdamW (decoupled weight decay), and SGD (optional Nesterov momentum). The CLI exposes subcommands: swiftllm train, swiftllm finetune, and swiftllm grpo.

Simulated backend: CUDA training kernels are not yet wired up. Every training run prints a visible [SIMULATED] banner. Real gradient training ships in a later release.

Supported Methods

Method	Description	Memory
LoRA	Low-Rank Adaptation — trains small adapter matrices	Low
QLoRA	4-bit quantized base model + LoRA adapters (~65-70% less VRAM)	Very Low
Full	Full parameter fine-tuning	High

LoRA Fine-Tuning

Python

from swiftllm import fine_tune, LoRAConfig, TrainingConfig

lora = LoRAConfig(r=16, alpha=32, dropout=0.05)
config = TrainingConfig(learning_rate=2e-4, num_epochs=3,
                        per_device_batch_size=4)

fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_data.jsonl",
          lora_config=lora, training_config=config,
          output_dir="./output")

bash

# CLI — LoRA fine-tuning
swiftllm finetune -m meta-llama/Llama-3-8B \
  --train-data ./data/train.jsonl \
  --lora-r 16 --lora-alpha 32 --learning-rate 2e-4

# CLI — QLoRA (4-bit base + LoRA)
swiftllm train -m meta-llama/Llama-3-8B \
  --train-data ./data/train.jsonl --method qlora \
  --lora-r 16 --mixed-precision bf16

Auto-Ingest from Files or HuggingFace

Pass a directory, file list, or HuggingFace dataset directly to fine_tune() — SwiftLLM automatically converts it to JSONL before training. No manual dataset preparation needed.

Python

# Fine-tune from a directory (auto-ingested to JSONL)
fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_corpus/",
          output_dir="./output", lora_r=16)

# Fine-tune from a HuggingFace dataset
fine_tune(model="meta-llama/Llama-3-8B",
          hf_dataset="tatsu-lab/alpaca",
          dataset_format="sft_completion",
          output_dir="./output", lora_r=32)

# Combine local files + HuggingFace in one command
fine_tune(model="meta-llama/Llama-3-8B",
          train_data="./my_docs/",
          hf_dataset="tatsu-lab/alpaca",
          hf_max_samples=10_000,
          dataset_format="sft_completion",
          output_dir="./output", lora_r=16)

GRPO Reinforcement Learning

Group Relative Policy Optimization — RL fine-tuning without a critic model. Includes CGAR depth curriculum (1.71x speedup), Process Reward Models (5 aggregation strategies for step-level feedback), and LongR dense token-level rewards (+9% LongBench v2).

Python

from swiftllm import GrpoTrainer, TrainingConfig
from swiftllm.config import GrpoConfig, CgarConfig, PrmConfig

config = TrainingConfig(
    model="meta-llama/Llama-3-8B",
    train_data="./data/math_prompts.jsonl",
    output_dir="./grpo_output",
    num_layers=32,
    grpo=GrpoConfig(group_size=8, kl_coeff=0.04),
    cgar=CgarConfig(),
    prm=PrmConfig(aggregation="last_step"),
)
trainer = GrpoTrainer(config)
trainer.train()

bash

# GRPO with full research stack: CGAR + PRM + LongR
swiftllm grpo -m meta-llama/Llama-3-8B \
  --train-data ./data/math_prompts.jsonl \
  --group-size 8 --enable-prm --long-reward-weight 0.1 \
  --num-layers 32

Dataset Ingestion

Convert directories of text, code, PDF, DOCX, CSV, HTML, and JSON files into JSONL training data in one command. Supports HuggingFace Hub datasets via HuggingFaceSource. Four output schemas: pretraining, sft_messages, sft_completion, code. SHA-256 deduplication across all sources.

bash

# Local files — code fine-tuning
swiftllm dataset --input ./src/ ./docs/ --output train.jsonl --format code

# HuggingFace dataset
swiftllm dataset --hf-dataset tatsu-lab/alpaca --output alpaca.jsonl --format sft_completion

# Mixed: local files + HuggingFace combined
swiftllm dataset --input ./my_docs/ --hf-dataset tatsu-lab/alpaca \
  --hf-max-samples 10000 --format sft_completion --output combined.jsonl

# Large corpus with streaming (avoids full download)
swiftllm dataset --hf-dataset HuggingFaceFW/fineweb --hf-subset sample-10BT \
  --hf-streaming --hf-max-samples 100000 --format pretraining --output fineweb.jsonl

Training Data Formats

Three accepted input formats for supervised fine-tuning:

Format	Schema	Use Case
Messages (JSONL)	`{"messages": [{"role": "...", "content": "..."}]}`	Chat / instruction tuning (recommended)
Prompt-Completion (JSONL)	`{"prompt": "...", "completion": "..."}`	Classic SFT, GRPO (add `"answer"` for RL)
Plain Text / CSV	One example per line or CSV with `prompt`/`completion` columns	Language modelling pretraining

Advanced Inference

SwiftLLM v2.0 includes six test-time inference enhancements accessible via Python API and CLI flags. These improve output quality by spending more compute at inference time — no training required.

Self-Consistency: llm.generate_with_self_consistency() — majority voting over N independent chains (Wang et al., 2022)
Self-Refinement: llm.generate_with_refinement() — iterative critique→revision cycles (Madaan et al., 2023)
Best-of-N: llm.generate_best_of_n() — dense scoring and reranking of N candidates
Disaggregated Serving: DisaggregatedServingConfig — separate prefill/decode worker pools (Splitwise/DistServe)
RLM: llm.generate_with_rlm() — recursive self-calling with REPL sandbox and variable binding
Dense Verification: llm.generate_with_dense_verification() — cross-attention token/step confidence scoring
TurboQuant: Set EngineConfig(turbo_quant=TurboQuantConfig.quality_neutral()) for 3-5x KV cache compression

CLI Commands

The swiftllm CLI provides subcommands covering inference, serving, training, dataset ingestion, and model management.

serve

OpenAI-compatible API server

generate

Batch text generation with --self-consistency, --refinement-rounds, --best-of-n, --rlm, --dense-verification flags

chat

Interactive REPL

download

Pull models from HF Hub

benchmark

Throughput & TTFT tests

convert

HF ↔ SafeTensors ↔ GGUF

info

Model and system info

train

Full / LoRA / QLoRA training

finetune

LoRA convenience command

grpo

GRPO RL training with CGAR/PRM/LongR

dataset

Multi-format ingestion to JSONL with HuggingFace support

Global Flags

Flag	Description
`--model`	HF repo id, local path, or GGUF file
`--dtype`	`auto`, `float16`, `bfloat16`, `float32`, `int8`, `int4`
`--quantization`	`awq`, `gptq`, `bnb4`, `turboquant`, or `none`
`--tensor-parallel`	Number of GPUs for tensor parallelism
`--max-model-len`	Override the model's context length

Python API

Inference

Export	Description
`LLM`	Synchronous inference engine
`AsyncLLM`	Async engine with streaming iterators
`SamplingParams`	temperature, top_p, top_k, max_tokens, stop strings
`RequestOutput`	Per-request result wrapper
`EngineConfig`	gpu_memory_utilization, dtype, quantization, max_model_len

Training

Export	Description
`Trainer`	High-level training loop with callbacks, early stopping, checkpointing
`TrainingConfig`	learning_rate, num_epochs, per_device_batch_size, lr_scheduler, mixed_precision
`LoRAConfig`	r, alpha, dropout, target_modules, use_rslora
`fine_tune`	One-call convenience — accepts JSONL, directory, or HuggingFace dataset
`GrpoTrainer`	GRPO RL training with CGAR curriculum integration
`GrpoConfig`	group_size, kl_coeff, clip_eps, correctness/format/length weights
`CgarConfig`	Curriculum-Guided Adaptive Recursion — shallow_end, medium_end phases
`PrmConfig`	Process Reward Model — 5 aggregation strategies (Min/Mean/Product/LastStep/WeightedMean)
`LongRewardConfig`	LongR dense token-level rewards — weight, aggregation, normalise
`grpo_train`	One-call GRPO convenience function

Dataset Ingestion

Export	Description
`DatasetIngester`	Full-control multi-format dataset ingestion API
`IngestionConfig`	input_paths, output_path, format, chunk_size, hf_sources
`IngestionResult`	Ingestion summary — chunks, files, HF rows, skipped files
`HuggingFaceSource`	Describes one HuggingFace Hub dataset with auto-detected schema
`DatasetFormat`	Output format enum — pretraining, sft_messages, sft_completion, code
`ingest_dataset`	One-liner for local files, HuggingFace, or combined ingestion
`prepare_dataset`	Convenience shortcut available as `swiftllm.prepare_dataset()`

Phase 3 — Inference Configs

Export	Description
`SelfConsistencyConfig`	num_samples, extractor strategy, temperature
`RefinementConfig`	max_rounds, stopping criterion, improvement metric
`VerificationConfig`	num_candidates, scoring strategy, threshold
`DisaggregatedServingConfig`	Separate prefill/decode worker pool config
`RlmConfig`	max_depth, mode (DISABLED/SHALLOW/REASONING/AGENTIC)
`DenseVerificationConfig`	min_confidence, strategy (SCORE_ONLY/GATE/GATE_AND_REGEN)
`TurboQuantConfig`	key_bits, value_bits, presets: quality_neutral(), aggressive()

Hybrid Architecture (Phase 1)

Export	Description
`HybridModelBuilder`	Fluent builder for hybrid Mamba-3 models
`build_mamba3_reasoning_model`	Mamba-3 + LatentMoE + RLM + Dense Verification preset
`HybridModelConfig`	Top-level architecture configuration
`MambaConfig`	Mamba-3 SSM layer configuration
`LatentMoeConfig`	Latent MoE FFN configuration (87.5% less inter-GPU traffic)
`estimate_parameters`	Compute total parameter count for a `HybridModelConfig`
`parameter_summary`	Formatted breakdown of parameter counts by component

Configuration

Every SWIFTLLM_* environment variable maps to a field in EngineConfig or ServerConfig. Set them in your shell profile, systemd unit, or Docker Compose file — they are read at startup and override coded defaults. Explicit constructor arguments and CLI flags always take final precedence.

GPU & Memory

Variable	Default	Description
`SWIFTLLM_GPU_MEMORY_UTILIZATION`	0.90	Fraction of GPU VRAM for model weights + KV cache (0.0–1.0). Raise to ~0.95 on dedicated hosts.
`SWIFTLLM_GPU_OVERHEAD_MB`	0	VRAM (in MB) to reserve for the OS and other processes.
`SWIFTLLM_NUM_GPU_LAYERS`	all	Number of layers to offload to GPU. `0` = CPU-only, `999` = all.
`SWIFTLLM_SWAP_SPACE`	4.0	CPU swap space in GiB for KV cache offloading.
`SWIFTLLM_CPU_OFFLOAD_GB`	0.0	Model weight gigabytes to keep on CPU RAM instead of GPU.
`SWIFTLLM_KV_CACHE_DTYPE`	auto	Data type for the KV cache. `fp8_e4m3`/`fp8_e5m2` halves memory.
`SWIFTLLM_BLOCK_SIZE`	16	Tokens per PagedAttention block. Allowed: `8`, `16`, `32`.
`SWIFTLLM_FLASH_ATTENTION`	true	Enable FlashAttention kernels.
`SWIFTLLM_ENFORCE_EAGER`	false	Disable CUDA graph capture; use eager execution.
`CUDA_VISIBLE_DEVICES`	(all)	Restrict which GPUs are visible, e.g. `0,2`.

Tensor Parallelism & Multi-GPU

Variable	Default	Description
`SWIFTLLM_TENSOR_PARALLEL_SIZE`	1	GPUs for tensor parallelism. Must evenly divide attention heads.
`SWIFTLLM_PIPELINE_PARALLEL_SIZE`	1	Pipeline-parallel stages.
`NCCL_DEBUG`	—	NCCL logging level: `INFO`, `WARN`, `TRACE`.
`NCCL_P2P_DISABLE`	0	Set to `1` to disable GPU peer-to-peer on certain PCIe topologies.

Scheduling & Batching

Variable	Default	Description
`SWIFTLLM_MAX_NUM_SEQS`	256	Maximum concurrent sequences in a batch.
`SWIFTLLM_MAX_NUM_BATCHED_TOKENS`	8192	Maximum total tokens per forward pass.
`SWIFTLLM_MAX_PADDINGS`	256	Maximum padding tokens tolerated per batch.
`SWIFTLLM_SCHEDULER_POLICY`	fcfs	`fcfs`, `sjf`, or `priority`.
`SWIFTLLM_PREEMPTION_MODE`	swap	`swap` (KV cache to CPU) or `recompute`.
`SWIFTLLM_ENABLE_PREFIX_CACHING`	false	Reuse KV cache across requests sharing the same prefix.
`SWIFTLLM_ENABLE_CHUNKED_PREFILL`	false	Interleave prefill and decode; reduces time-to-first-token.
`SWIFTLLM_NUM_PARALLEL`	1	Parallel inference slots per model.
`SWIFTLLM_MAX_LOADED_MODELS`	1	Models held in GPU memory simultaneously.
`SWIFTLLM_KEEP_ALIVE`	300	Seconds a model stays loaded after last request.

Speculative Decoding

Variable	Default	Description
`SWIFTLLM_SPECULATIVE_MODEL`	—	Draft model for speculative decoding.
`SWIFTLLM_NUM_SPECULATIVE_TOKENS`	5	Tokens to draft per step.
`SWIFTLLM_SPECULATIVE_MAX_MODEL_LEN`	—	Override max sequence length for the draft model.

Model & Weights

Variable	Default	Description
`SWIFTLLM_MODEL_DIR`	~/.cache/swiftllm/models	Default directory for downloaded models.
`SWIFTLLM_OFFLINE`	false	Set to `1` to disable all network downloads (air-gapped mode).
`SWIFTLLM_DTYPE`	auto	Weight data type: `auto`, `float16`, `bfloat16`, `float32`, `int8`, `int4`, `fp8_e4m3`, `fp8_e5m2`.
`SWIFTLLM_QUANTIZATION`	none	Quantization method: `none`, `awq`, `gptq`, `squeezellm`, `gguf`.
`SWIFTLLM_MAX_MODEL_LEN`	(model default)	Override the model's max sequence length.
`SWIFTLLM_TRUST_REMOTE_CODE`	false	Allow executing custom code from HuggingFace repos.
`SWIFTLLM_DEVICE`	auto	Device: `auto`, `cuda`, `cpu`, `metal`, `rocm`.
`SWIFTLLM_SEED`	0	Global random seed.
`HF_TOKEN`	—	HuggingFace API token for gated models.

LoRA

Variable	Default	Description
`SWIFTLLM_ENABLE_LORA`	false	Enable LoRA adapter support in the inference engine.
`SWIFTLLM_MAX_LORAS`	1	Maximum LoRA adapters loaded simultaneously.
`SWIFTLLM_MAX_LORA_RANK`	16	Maximum LoRA rank.

Server & Networking

Variable	Default	Description
`SWIFTLLM_HOST`	127.0.0.1	HTTP bind address. Localhost by default for security; set `0.0.0.0` to expose externally.
`SWIFTLLM_PORT`	8000	Listen port.
`SWIFTLLM_API_KEY`	—	Bearer-token API key (constant-time comparison).
`SWIFTLLM_CORS_ALLOW_ORIGINS`	*	Comma-separated allowed CORS origins.
`SWIFTLLM_SSL_CERTFILE`	—	Path to TLS certificate for HTTPS.
`SWIFTLLM_SSL_KEYFILE`	—	Path to TLS private key for HTTPS.
`SWIFTLLM_ROOT_PATH`	—	URL prefix for reverse-proxy deployments.
`SWIFTLLM_SERVED_MODEL_NAME`	—	Override model name in API responses.
`SWIFTLLM_MAX_LOG_LEN`	—	Truncate request/response logs to this many characters.
`SWIFTLLM_RESPONSE_ROLE`	assistant	Default role name in chat completion responses.
`SWIFTLLM_METRICS_ENABLED`	true	Expose Prometheus `/metrics` (authenticated when API key is set).

Build & CUDA

Variable	Default	Description
`CUDA_PATH` / `CUDA_HOME`	—	Path to CUDA toolkit.
`CUDACXX`	—	Path to `nvcc` binary.
`CMAKE_ARGS`	—	Extra CMake arguments for llama-cpp-python build.

Logging & Debug

Variable	Default	Description
`RUST_LOG`	info	Rust log level: `trace`, `debug`, `info`, `warn`, `error`.
`SWIFTLLM_LOG_LEVEL`	—	Python log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`.

Supported Models

Ten model families supported out of the box, with auto-detection for HuggingFace, GGUF, SafeTensors, and PyTorch formats. Phase 1 research models (Mamba, Jamba) are available when building with hybrid architecture support.

Model Family	Variants	Notes
LLaMA 3	8B, 70B, 405B	Meta's latest flagship. Full tensor-parallel support for 70B/405B.
Code Llama	7B, 13B, 34B	Code-specialized. Python, Instruct, and base variants.
Mistral / Mixtral / Devstral	7B, 8x7B, Devstral	Dense (7B), MoE (Mixtral 8x7B), and code-agent (Devstral) variants.
Qwen 3 / 3.5	0.6B – 235B	Alibaba multilingual. MoE and dense variants.
Phi-3 / Phi-4	Mini, Small, Medium	Microsoft small-language models optimized for edge deployment.
Deepseek R1 / V4	R1-Distill, V4-0324	Reasoning-focused. R1 chain-of-thought and V4 MoE architectures.
Gemma	2B, 7B	Google open models. Instruct and base variants.
Mamba	130M – 2.8B	Phase 1 — SSM backbone. Requires hybrid architecture build.
Jamba	Hybrid configs	Phase 1 — Mamba + Attention + MoE hybrid (AI21 architecture).

All models support HuggingFace (SafeTensors/PyTorch), GGUF, and raw SafeTensors formats. Format is auto-detected by file extension. Use swiftllm convert to move between formats.

Get help

Three channels, tuned to different needs. Community-first, paid support available for production deployments.

GitHub Discussions

Ask questions, share benchmarks, and trade tips with the community.

Open forum

Issue Tracker

Found a bug or want to request a feature? Open an issue with a minimal repro.

Report an issue

Enterprise Support

SLAs, private deployments, and architecture reviews for production workloads.

Frequently asked questions

Do I need a GPU to run SwiftLLM?

No. As of v2.0 CPU is the default build target. The installer only pulls CUDA libraries if you pass --gpu or run on a machine with nvidia-smi detected. CPU wheels ship for x86_64 and aarch64 on both Linux and macOS, plus arm64 on Apple Silicon.

Which model formats are supported?

HuggingFace (SafeTensors and PyTorch), GGUF (via llama-cpp-python), and raw SafeTensors files. Format is auto-detected by file extension. Use swiftllm convert to move between formats.

Is training production-ready yet?

The full training pipeline — CLI, config parsing, checkpointing, dataset loading, GRPO RL, CGAR curriculum, PRM scoring, and LongR dense rewards — is wired up and usable. However, the actual CUDA gradient kernels are still stubbed; runs print a visible [SIMULATED] banner. Real GPU gradient training is planned for a future release — follow the changelog for updates.

Can I deploy on air-gapped networks?

Yes. Run airgap-bundle.sh on a connected machine to build a portable tarball with source, Python wheels, the Rust installer, and optional pre-downloaded models. SHA256 checksums verify integrity on the target.

How does throughput compare to vLLM?

Run the included benchmark suite on your own hardware — swiftllm benchmark --concurrency 32 --num-requests 1000. We consistently see competitive tokens/sec on LLaMA-3-8B with PagedAttention + continuous batching enabled. See the changelog for tested throughput per release.

Is the OpenAI API really a drop-in replacement?

Yes for /v1/completions and /v1/chat/completions including streaming via SSE. Point any OpenAI client (Python, Node, curl) at http://localhost:8000/v1 and it just works. API key auth, CORS, and Prometheus metrics are built in.

Troubleshooting

Installer fails with "externally-managed environment"

You're on Debian 12+, Ubuntu 23.04+, or recent macOS. Let the installer create a venv (the default), pass --venv ~/.swiftllm, or activate your own environment before running.

`nvidia-smi` detected but CUDA build fails

Ensure nvcc is on your PATH (which nvcc). Driver version must be compatible with CUDA 11.8+. On Ubuntu, the nvidia-cuda-toolkit package is often too old — install CUDA directly from NVIDIA.

Training shows `[SIMULATED]`

This is expected behaviour — the training loop is a simulation stub. Full GPU training requires PyTorch and a CUDA environment. See the Training section.

Model downloads hang

Set HF_HUB_ENABLE_HF_TRANSFER=1 for the accelerated downloader, or pass --revision main to pin the reference. For air-gapped installs, set SWIFTLLM_OFFLINE=1 and pre-populate the cache.

Release history

Track every update, feature, and fix as SwiftLLM evolves.

v2.0

Latest Stable 2026

Stable release — all beta features promoted to production. Hybrid Mamba-3, GRPO training, TurboQuant KV cache compression, and advanced test-time inference.

AddedHybrid Mamba-3 SSM + Transformer + MoE architecture (Phase 1) — selective SSM with MIMO multi-head scan, LatentMoE with dynamic-bias load balancing, Jamba-style hybrid blocks
AddedGRPO / CGAR / PRM / LongReward training pipelines (Phase 2) — reinforcement learning fine-tuning without a critic model
AddedSelf-consistency, refinement, best-of-N, disaggregated serving, Recursive Language Model, and Dense Verification (Phase 3)
AddedTurboQuant KV cache compression (ICLR 2026) — Walsh-Hadamard rotation + Max-Lloyd codebooks for 3-5x memory reduction
AddedLifecycle scripts (install.sh, update.sh, uninstall.sh) for full install management
AddedComprehensive static analysis sweep — zero Clippy warnings, logic bug fixes
Added329 backend tests and 66 frontend tests — all passing

v2.0.6-beta

Beta 2026

TurboQuant KV cache compression — online vector quantization from ICLR 2026.

AddedFull TurboQuant implementation (Zandieh et al., arXiv 2504.19874) — random rotation via fast Walsh-Hadamard transform, precomputed Max-Lloyd codebooks for 1-4 bit scalar quantization
AddedTurboQuantMse (MSE-optimal) and TurboQuantProd (unbiased inner products via JL sign sketch) variants
AddedTurboQuantConfig in Rust and Python with quality_neutral() and aggressive() presets
AddedTurboQuantKvCache — compressed KV cache layer with slot-based store/load/clear
Added25 Rust + 16 Python tests covering codebook construction, rotation invertibility, roundtrip quality, memory stats

v2.0.5-beta

Beta 2026

Code quality & zero-warning build, lifecycle scripts, comprehensive static analysis and bug fixes.

FixedResolved all 130+ compiler warnings across 4 workspace crates — dead code annotations, doc comments, unused imports
FixedLogic bug in CurriculumScheduler::ssm_lr_scale() — identical if/else branches; replaced with correct linear inverse mapping
FixedIncorrect #[cfg(has_cuda)] in PyO3 bindings — changed to proper #[cfg(feature = "cuda")]
Addedupdate.sh and uninstall.sh lifecycle scripts with branch/tag switching and purge options
FixedPython SDK imports cleanly without PyTorch — torch_model features guarded with try/except ImportError
Added304 Rust tests + 50 Python tests — all passing with zero actionable warnings

v2.0.4-beta

Beta 2026

Security audit, CUDA acceleration kernels, and PyTorch integration.

SecurityCritical: Added Drop for CUDA storage, check_cuda_last_error() after all kernel launches, constant-time API key comparison
SecurityHigh: CORS from config, authenticated /metrics, default bind 127.0.0.1, input validation on legacy endpoints
AddedCustom CUDA kernels: mamba3_scan.cu, moe_dispatch.cu, dense_verif_attn.cu, rlm_ops.cu, linear_f16.cu
Addedhybrid_model.py and torch_model.py — PyTorch nn.Module bridge for GPU-executable hybrid models
FixedInteger overflow in block sizing, TOCTOU race in scheduler, UTF-8 slicing panic in repr

v2.0.3-beta

Beta 2026

Phase 1 hybrid architectures, Phase 2 training enhancements, Phase 3 inference and model-level reasoning.

AddedPhase 1: mamba.rs, moe.rs, jamba.rs — Mamba SSM with MIMO scan, sparse MoE routing, Jamba hybrid blocks
AddedPhase 2: grpo.rs, curriculum.rs, process_reward.rs, long_reward.rs — GRPO RL, CGAR curriculum, PRM step scoring, LongR dense rewards
AddedPhase 3: Self-consistency voting, multi-round self-refinement, best-of-N verification, disaggregated prefill/decode serving
AddedPhase 3 Model: rlm.rs (Recursive Language Model with REPL sandbox) and dense_verification.rs (cross-attention draft scoring)
Added14 new Python config dataclasses, CLI flags for self-consistency/refinement/best-of-N/RLM/dense-verification, 4 new example scripts

v2.0.2-beta

Beta 2026

HuggingFace dataset support and multi-format dataset ingestion pipeline.

AddedHuggingFaceSource — auto-detects Alpaca, ShareGPT, OpenAI messages, prompt/completion, Q&A, and plain text schemas
AddedMulti-format ingestion: .txt, .md, .py, .rs, .pdf, .docx, .csv, .html, .jsonl and 40+ more into JSONL
Added4 output schemas: pretraining, sft_messages, sft_completion, code — with SHA-256 dedup and code-aware chunking
AddedCLI swiftllm dataset subcommand with 14 --hf-* flags; streaming mode for large corpora
AddedAuto-ingest in Trainer and fine_tune() — directories and HF datasets converted transparently

v2.0.1-beta

Beta 2026

Training UX fixes — simulated-stub banner, train-data path validation, and full regression coverage.

FixedTrainer.train() now prints a visible [SIMULATED] banner making it obvious the current loop is a stub (no weights, no gradients) rather than real training
Fixedswiftllm train and swiftllm finetune validate --train-data / --eval-data paths up front — missing, non-regular, or empty files now fail fast with clear errors
ChangedValidation covers paths loaded from --config JSON so typos in saved configs surface immediately
AddedFull regression matrix: install → download → generate (18.61 tok/s) → finetune (LoRA) → train on Ubuntu 24.04 + CUDA 13.0

962f6d8

v2.0.0.2-alpha

Alpha2026

CPU & ARM wheel support, installer portability, and critical Rust-side safety fixes.

AddedCPU-only build is now the default — portable wheel with zero CUDA dependencies, buildable on Apple Silicon, Graviton, Raspberry Pi, Jetson, Ampere
AddedTop-level swiftllm crate exposes cpu and cuda Cargo features; CUDA opt-in via ./install.sh --gpu
Addedairgap-bundle.sh --arch flag auto-maps to pip platform tags and rustup target triples for cross-architecture bundles
Added[serve] optional dependency (fastapi, uvicorn) wired into install.sh and the airgap bundle
FixedPEP 668 handling on Ubuntu 23.04+ / Debian 12+ externally-managed Python
FixedReplaced non-portable grep -oP with sed for CUDA detection (macOS/BSD compatible)
SecurityFixed 14 partial_cmp().unwrap() calls — replaced with unwrap_or(Ordering::Equal) to prevent NaN panics
SecurityAdded checked_add bounds validation in GGUF loader — malformed files can no longer cause slice panics

ae87421

v2.0.0.1-alpha

Alpha2026

Air-gapped installation and comprehensive security hardening.

Addedairgap-bundle.sh creates portable install archives (source, pip wheels, rustup-init, optional models)
Addedinstall.sh --airgap flag for fully offline installation from a bundle
AddedRuntime offline mode via SWIFTLLM_OFFLINE=1 — disables all HF downloads
SecurityCritical: Fixed JSON injection in SSE streaming — replaced raw format!() with serde_json::json!()
SecurityCritical: Fixed shell injection in airgap-bundle.sh — model names now passed via sys.argv
SecurityCritical: Added SHA256 checksum verification for downloaded rustup-init
SecurityPath validation against directory traversal; symlink traversal protection in offline cache
FixedUse-after-move in engine.rs; usize negation in trainer.rs; missing mut on eval_data

9ad73bb

v2.0.0-alpha

Alpha2026

Training & fine-tuning infrastructure, engine optimization, scheduler improvements.

Addedswiftllm-training Rust crate with LoRA / QLoRA / full fine-tuning
AddedMuon optimizer with Newton-Schulz orthogonalization, plus AdamW and SGD with linear/cosine/constant LR schedulers
AddedDataset loading (JSONL/CSV/text) with instruction templates; rolling-window metrics; perplexity
AddedCheckpoint save/load with save_total_limit; Python Trainer with callbacks and EarlyStoppingConfig
AddedO(n) top-k / beam / logprobs selection via quickselect; /metrics JSON and Prometheus endpoints
FixedEliminated redundant read-then-write lock in sampling hot path; numerically stable log-softmax
FixedO(n) victim selection for preemption scheduling; gradient clipping and LoRA scaling

4af1106

v1.0.0

Initial2026

Initial release — Rust core, PagedAttention, continuous batching, OpenAI API, Python SDK.

AddedComplete Rust rewrite with 5 modular crates (core, models, cuda, server, training)
AddedPagedAttention memory management: block allocator with copy-on-write
AddedContinuous batching scheduler with preemption (swap / recompute)
AddedToken sampling: greedy, temperature, top-k, top-p, min-p, repetition penalty
AddedOpenAI-compatible HTTP API built with Axum; Python SDK (LLM, AsyncLLM, SamplingParams)
AddedSpeculative decoding with draft-model acceleration; multi-GPU tensor and pipeline parallelism

04104da

Everything you need to ship.

Installation

Quick Start

API Server

Training & Fine-Tuning

Installation

Prerequisites

Install Options

Platforms

GPU requirements

Quick Start

Basic Inference

Self-Consistency Voting

TurboQuant KV Cache Compression

Start an API Server

CLI — Advanced Generation

Air-Gapped Installation

Update

Update Flags

Uninstall

Uninstall Flags

Inference

Sync, async, and streaming

Sampling parameters

Streaming example

Test-Time Inference Enhancements (Phase 3)

TurboQuant KV Cache Compression

Recursive Language Model (RLM)

Dense Verification Layer

API Server

Endpoints

Security Hardening

Launch examples

Training & Fine-Tuning

Supported Methods

LoRA Fine-Tuning

Auto-Ingest from Files or HuggingFace

GRPO Reinforcement Learning

Dataset Ingestion

Training Data Formats

Advanced Inference

CLI Commands

Global Flags

Python API

Inference

Training

Dataset Ingestion

Phase 3 — Inference Configs

Hybrid Architecture (Phase 1)

Configuration

GPU & Memory

Tensor Parallelism & Multi-GPU

Scheduling & Batching

Speculative Decoding

Model & Weights

LoRA

Server & Networking

Build & CUDA

Logging & Debug

Supported Models

Get help

GitHub Discussions

Issue Tracker

Enterprise Support

Frequently asked questions

Troubleshooting

Installer fails with "externally-managed environment"

nvidia-smi detected but CUDA build fails

Training shows [SIMULATED]

Model downloads hang

Release history

`nvidia-smi` detected but CUDA build fails

Training shows `[SIMULATED]`