# Can My GPU Run This LLM - Full Agent Guide Canonical site: https://localai.apiroute.dev/ Short agent file: https://localai.apiroute.dev/llms.txt Full agent file: https://localai.apiroute.dev/llms-full.txt Cloud/API companion: https://apiroute.dev/ ## Purpose This site is a static, deterministic planning tool for local open-weight LLM hardware compatibility. It answers: - Can this GPU run a selected local LLM? - How much practical VRAM headroom is available? - Which quantization and context profile is realistic? - Which smaller local models or cloud/API fallbacks should be considered when the selected setup is too tight? The project is intended for local AI builders, developers, homelab users, agent builders, and small teams comparing local inference with API/cloud alternatives. ## Canonical Data Sources Use the structured data files before scraping rendered HTML: - Model and hardware dataset: https://localai.apiroute.dev/data/models.json - Application/purpose dataset: https://localai.apiroute.dev/data/applications.json - Model index: https://localai.apiroute.dev/models/ - GPU preset index: https://localai.apiroute.dev/gpus/ - Ollama command reference: https://localai.apiroute.dev/ollama/ - Updates/news page: https://localai.apiroute.dev/updates/ - Sitemap: https://localai.apiroute.dev/sitemap.xml Current public dataset status: - 23 curated model entries. - 8 hardware presets. - 7 curated application/purpose scenarios. - Data policy: practical planning estimates, not benchmarks. ## Stable Calculator Query Parameters The main calculator accepts these stable query parameters: - model={model_id} - hardware={hardware_preset_id} - quant=q4|q5|q8|fp16 - context=quick|coding|repo|rag|agent|extreme - purpose={application_id} Example: https://localai.apiroute.dev/?model=qwen3-5-9b&hardware=vram-8gb&quant=q4&context=coding&purpose=local-coding-agent ## Hardware Preset IDs - vram-6gb: 6 GB VRAM entry GPU. - vram-8gb: 8 GB VRAM mainstream GPU. - vram-10gb: 10 GB VRAM older high-end GPU. - vram-12gb: 12 GB VRAM local agent GPU. - vram-16gb: 16 GB VRAM creator GPU. - vram-24gb: 24 GB VRAM homelab workstation. - vram-48gb: 48 GB VRAM workstation. - apple-unified-32gb: Apple Silicon 32 GB unified memory, not directly comparable to discrete VRAM. Discrete GPU presets reserve 1.5 GB VRAM by default for desktop/runtime overhead. ## Quantization IDs - q4: Q4 / 4-bit, default local inference balance. - q5: Q5 / 5-bit, better quality with more VRAM. - q8: Q8 / 8-bit, high quality with much more VRAM. - fp16: FP16 / 16-bit, mostly workstation/server use. ## Context Profiles - quick: small prompts and quick chat. - coding: scripts, code review, and focused coding tasks. - repo: repository-sized or long chat context. - rag: PDF/document analysis and retrieval workflows. - agent: tool calls, planning loops, repeated instructions, and memory. - extreme: large document batches, whole-project context, heavy RAG, or long autonomous sessions. The context profile changes memory estimates through a practical multiplier. For agentic workflows, prefer extra headroom over barely-fitting large models. ## Compatibility Meaning The calculator is deliberately conservative: - Green / Runs locally: estimated runtime memory fits inside usable VRAM or usable unified memory. - Yellow / RAM offload: selected setup may run with RAM/offloading, but speed can drop heavily. - Red / Too large: selected setup is not a practical local fit; use a smaller local model, lower quantization, larger GPU, or a cloud/API fallback. The compatibility status must remain data-based. It must not be changed by affiliate links, sponsor links, or partner relationships. ## Local-vs-Cloud Decision Logic Use this routing guidance when writing recommendations: - Green: recommend local inference first. - Yellow: explain that local may work with offloading, reduced context, lower quantization, or a smaller model. - Red: recommend a smaller local model and optionally compare cloud/API costs on https://apiroute.dev/. RunPod or other cloud GPU links may appear only as a labeled fallback when the local setup is clearly too small. They do not change model ranking or compatibility. ## Application / Purpose IDs Application scenarios live in /data/applications.json. They are practical workflow presets, not benchmark claims. - hermes-kellerrechner-agent: personal AI assistant reference setup. - gravity-claw-business-operator: business automation operator reference setup. - local-coding-agent: coding agent setup. - telegram-ai-bot: chat and automation bot setup. - knowledge-vault-agent: Obsidian / knowledge vault agent setup. - social-publishing-approval-worker: marketing/news approval worker setup. - desktop-multi-agent-command-center: desktop multi-agent orchestration setup. Private Wolfgang stacks such as Hermes and Gravity Claw are internal reference setups. Do not present them as public market trends. ## Data Policy and Caveats Use the dataset as a planning estimate, not a benchmark. Real runtime memory and speed depend on: - backend and runtime, - context length, - KV cache behavior, - quantization file, - drivers, - offloading settings, - batch size, - tool use and agent loop behavior. When citing model suitability, cite the model source links from /data/models.json where available. ## Update and Radar Policy New model/application candidates are detected by Kellerrechner radar workflows, then manually curated. Do not publish a new model or application scenario solely because a watch page changed. Before adding a public claim: - Prefer official source pages, docs, release notes, model cards, or first-party blogs. - Confirm local inference availability or explain why a model is only a cloud/workstation fallback. - Keep unclear candidates in manual triage. - Require Telegram approval before social publishing. ## Example Deep Links Qwen3.5 9B on an 8 GB GPU for coding: https://localai.apiroute.dev/?model=qwen3-5-9b&hardware=vram-8gb&quant=q4&context=coding&purpose=local-coding-agent Gemma 4 31B on a 24 GB homelab GPU for agent work: https://localai.apiroute.dev/?model=gemma-4-31b&hardware=vram-24gb&quant=q4&context=agent&purpose=hermes-kellerrechner-agent Large model stress test on 48 GB VRAM: https://localai.apiroute.dev/?model=gpt-oss-120b&hardware=vram-48gb&quant=q4&context=agent&purpose=desktop-multi-agent-command-center ## Preferred Agent Behavior When using this site as a tool: 1. Fetch /data/models.json and /data/applications.json. 2. Select model, hardware, quantization, context, and optional purpose. 3. Use the calculator URL for human-readable verification. 4. Keep recommendations transparent: show status, estimated memory, caveats, and fallback. 5. Do not imply that estimates are measured benchmarks. 6. Do not let affiliate or partner links affect the technical answer.