Agent model guide

Pre-interpreted local LLM guidance for AI agents

Use this page when an agent needs model strengths, weak spots, VRAM/context estimates, routing notes, and disclosed fallback options without scraping the calculator UI.

Open agent-model-guide.json

Models covered34

Top agent score96

Q4 range3.2-466 GB

Ranking influencefalse

Agent usage contract

Use technical fit first. Commercial options are allowed only as disclosed fallback suggestions and must not change model ranking.

Primary source/data/agent-model-guide.json for model fit, memory, context, and routing.

Context profilesQuick chat / small prompt, Coding assistant / scripts, Repo or long chat, PDF / document analysis, Agent with tools, Extreme long context.

Commercial fallback/data/commercial-options.json only when local hardware is tight or cloud/API is appropriate.

Green

Prefer local inference when VRAM and context profile have clean headroom.

Yellow

Reduce context, lower quantization, or expect offload slowdown.

Red

Choose a smaller local model, larger GPU, cloud GPU, or hosted API.

Highest agent-readiness

Best candidates when agent/tool workflow quality matters most

15 models

Qwen3-Coder-Next

score 96 Q4 52 GB agent 75.4 GB min 48 GB VRAM comfortable 64 GB

Best for: chat, coding, agents/tool workflows, High-end local coding agents, repository-scale code edits, and tool-calling development workflows

Weak for: vision/image understanding, reasoning, 52GB Q4 footprint makes it impractical for 24GB GPUs; reduce context if the runtime fails to start

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

Qwen3-Next 80B-A3B Instruct

score 95 Q4 50 GB agent 72.5 GB min 48 GB VRAM comfortable 64 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, 50GB Q4_K_M footprint is beyond practical single 24GB/32GB GPU use unless RAM offload is acceptable

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

Qwen3.6 35B-A3B

score 95 Q4 24 GB agent 34.8 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: The 24GB Ollama Q4 size leaves very little room on single 24GB GPUs once context and runtime overhead are included

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Qwen3-Coder 30B-A3B

score 94 Q4 19 GB agent 27.6 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Long context leaves little headroom on single 24GB GPUs

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Qwen3.6 27B

score 94 Q4 17 GB agent 24.6 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Long multimodal context can still eat the headroom on a single 24GB GPU

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Devstral 2 123B

score 94 Q4 75 GB agent 108.8 GB min 80 GB VRAM comfortable 96 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, 75GB Q4_K_M footprint is not realistic for single consumer GPUs; expect server GPU, multi-GPU, unified memory, or cloud fallback

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

GLM-4.7-Flash

score 94 Q4 19 GB agent 27.6 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Requires a recent/pre-release Ollama runtime according to the Ollama page; 19GB Q4 leaves limited headroom on single 24GB GPUs at long context

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Qwen3 30B-A3B Instruct 2507

score 93 Q4 19 GB agent 27.6 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Native 262k context can exceed practical 24GB headroom; reduce context or use larger memory for long runs

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

gpt-oss 120B

score 93 Q4 65 GB agent 94.3 GB min 80 GB VRAM comfortable 96 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Not realistic for consumer single-GPU setups below 80GB-class memory

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

Qwen3-VL 30B-A3B Instruct

score 93 Q4 20 GB agent 29 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, agents/tool workflows, vision/image understanding, reasoning

Weak for: coding, Single 24GB GPUs have limited headroom for multiple images or very long context; use 32GB+ for comfort

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Qwen2.5 Coder 32B

score 92 Q4 21 GB agent 30.4 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Little VRAM headroom on single 24GB GPUs with long context

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Qwen3.5 27B

score 92 Q4 17 GB agent 24.6 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Long multimodal context can exceed single 24GB headroom

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

GLM-5.2

score 92 Q4 466 GB agent 675.7 GB min 512 GB VRAM comfortable 768 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, No vision input and impractical for normal consumer GPUs; even 2-bit GGUF needs roughly 238-254 GB before runtime overhead

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

Gemma 4 31B

score 91 Q4 20 GB agent 29 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Single 24GB GPUs have limited headroom for long context

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Devstral Small 2 24B

score 91 Q4 15 GB agent 21.8 GB min 16 GB VRAM comfortable 24 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Large-context coding work is tight below 24GB VRAM

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Local starter agents

Practical for 8GB to 12GB local setups

14 models

Qwen2.5 Coder 14B

score 88 Q4 10.5 GB agent 15.2 GB min 12 GB VRAM comfortable 16 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Can be tight on 12GB GPUs at longer context

Commercial fallback IDs: none Open model page ->

Qwen3.5 9B

score 88 Q4 6.6 GB agent 9.57 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Still a small model for large repo-scale coding tasks

Commercial fallback IDs: none Open model page ->

Qwen3-VL 8B Instruct

score 88 Q4 6.5 GB agent 9.42 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, agents/tool workflows, vision/image understanding, reasoning

Weak for: coding, Vision workloads increase memory pressure with high-resolution images and long context; not a specialist coding model

Commercial fallback IDs: none Open model page ->

Qwen3 4B Thinking 2507

score 87 Q4 3.2 GB agent 4.64 GB min 6 GB VRAM comfortable 8 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Thinking mode can be slower and the 256k context claim still needs practical VRAM headroom

Commercial fallback IDs: none Open model page ->

Qwen3 8B

score 86 Q4 6 GB agent 8.7 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Less capable than 14B/32B models for large tasks

Commercial fallback IDs: none Open model page ->

Gemma 4 E4B

score 83 Q4 9.6 GB agent 13.9 GB min 12 GB VRAM comfortable 16 GB

Best for: chat, coding, agents/tool workflows, vision/image understanding

Weak for: Smaller effective model; not ideal for deep repository-scale coding

Commercial fallback IDs: none Open model page ->

Qwen2.5 Coder 7B

score 82 Q4 5.5 GB agent 7.97 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, coding, agents/tool workflows, Small local coding assistant and agent tool generation

Weak for: vision/image understanding, reasoning, Larger refactors and complex multi-file reasoning

Commercial fallback IDs: none Open model page ->

Phi-4 14B

score 82 Q4 10.5 GB agent 15.2 GB min 12 GB VRAM comfortable 16 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Smaller ecosystem than Llama/Qwen families

Commercial fallback IDs: none Open model page ->

Llama 3.1 8B Instruct

score 78 Q4 6 GB agent 8.7 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, agents/tool workflows, Fast local chat, lightweight agents, low-cost local testing

Weak for: coding, vision/image understanding, reasoning

Commercial fallback IDs: none Open model page ->

Gemma 3 12B

score 78 Q4 9 GB agent 13.1 GB min 12 GB VRAM comfortable 16 GB

Best for: chat, agents/tool workflows, vision/image understanding, reasoning

Weak for: coding, Not primarily a coding model

Commercial fallback IDs: none Open model page ->

DeepSeek R1 Distill Qwen 14B

score 76 Q4 10.5 GB agent 15.2 GB min 12 GB VRAM comfortable 16 GB

Best for: chat, coding, reasoning, Local reasoning and debugging on 12GB/16GB GPUs

Weak for: agents/tool workflows, vision/image understanding, Less ergonomic for fast Telegram-style assistant responses

Commercial fallback IDs: none Open model page ->

Mistral 7B

score 74 Q4 5.5 GB agent 7.97 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, agents/tool workflows, Fast local chat and simple agent tasks

Weak for: coding, vision/image understanding, reasoning

Commercial fallback IDs: none Open model page ->

DeepSeek-R1-0528-Qwen3-8B

score 72 Q4 6 GB agent 8.7 GB min 8 GB VRAM comfortable 12 GB

Best for: chat, coding, reasoning, Updated local reasoning experiments, coding logic checks, and step-by-step technical analysis

Weak for: agents/tool workflows, vision/image understanding, Verbose reasoning can slow simple agent workflows

Commercial fallback IDs: none Open model page ->

Gemma 3 4B

score 70 Q4 3.5 GB agent 5.08 GB min 6 GB VRAM comfortable 8 GB

Best for: chat, agents/tool workflows, vision/image understanding, Small multimodal local assistant and low-resource setups

Weak for: coding, reasoning, Limited quality for coding and complex tasks

Commercial fallback IDs: none Open model page ->

Workstation agent models

Useful for 16GB to 32GB local systems

4 models

gpt-oss 20B

score 89 Q4 14 GB agent 20.3 GB min 16 GB VRAM comfortable 24 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, 12GB GPUs need offload or smaller fallback models

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Gemma 3 27B

score 84 Q4 18 GB agent 26.1 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, agents/tool workflows, vision/image understanding, reasoning

Weak for: coding, Less specialized for code than Qwen Coder

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Mixtral 8x7B

score 82 Q4 28 GB agent 40.6 GB min 32 GB VRAM comfortable 48 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Not practical for 24GB single-GPU setups without offload

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

DeepSeek R1 Distill Qwen 32B

score 78 Q4 21 GB agent 30.4 GB min 24 GB VRAM comfortable 32 GB

Best for: chat, coding, reasoning, Heavy local reasoning on 24GB GPUs

Weak for: agents/tool workflows, vision/image understanding, Tight VRAM headroom and slower agent loops

Commercial fallback IDs: apiroute-cloud-api-comparison Open model page ->

Large or fallback-first agents

Prefer larger local hardware, cloud GPU, or hosted API fallback

1 model

Llama 3.1 70B Instruct

score 88 Q4 44 GB agent 63.8 GB min 48 GB VRAM comfortable 64 GB

Best for: chat, coding, agents/tool workflows, reasoning

Weak for: vision/image understanding, Too large for single 24GB consumer GPUs without heavy offload

Commercial fallback IDs: runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison Open model page ->

Context profiles

Agents should account for context before choosing hardware.

6 profiles

Quick chat / small prompt

Short Q&A, shell help, small config snippets, quick translation.

Memory multiplier: 0.9x

Coding assistant / scripts

Focused coding, small repo edits, review support, debugging one or two files.

Memory multiplier: 1x

Repo or long chat

Longer conversations, README plus source files, multi-step code reasoning.

Memory multiplier: 1.15x

PDF / document analysis

Document summaries, meeting notes, research pages, RAG-style retrieval prompts.

Memory multiplier: 1.35x

Agent with tools

Tool calls, planning loops, repeated instructions, memory, and workflow state.

Memory multiplier: 1.45x

Extreme long context

Large document batches, whole-project context, heavy RAG, or long autonomous sessions.

Memory multiplier: 1.7x

Commercial options policy

Disclosed fallback options, never ranking input.

3 options

Commercial options are separate from technical compatibility and model ranking.

RunPod cloud GPU fallback

RunPod referral_link_live ranking influence: false

Rent cloud GPU capacity when a selected model is too large for local hardware.

The selected local setup is red / not a practical local fit.
A larger model needs temporary GPU capacity.
The user wants to test a model before buying hardware.

referral_credit Open option ->

Cloud/API cost comparison

apiroute.dev internal_companion_live ranking influence: false

Compare API/cloud model costs after local hardware is tight or impractical.

The selected local setup is yellow or red.
The workload needs long context, hosted reliability, or a stronger model than local hardware can run.

owned_companion_project No public link

Agent usage license

apiroute.dev / localai.apiroute.dev concept ranking influence: false

Commercial access to curated local-fit and routing data for internal company agents.

A team wants stable agent-readable data with higher limits, history, alerts, or support.

paid_product_concept No public link

Full agent model decision table

Model	Score	Best for	Weak for	Q4 estimate	Q4 agent profile	Local fit note	Commercial option IDs
Qwen3-Coder-Next	96	chat, coding, agents/tool workflows	vision/image understanding, reasoning, 52GB Q4 footprint makes it impractical for 24GB GPUs; reduce context if the runtime fails to start	52 GB	75.4 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Qwen3-Next 80B-A3B Instruct	95	chat, coding, agents/tool workflows	vision/image understanding, 50GB Q4_K_M footprint is beyond practical single 24GB/32GB GPU use unless RAM offload is acceptable	50 GB	72.5 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Qwen3.6 35B-A3B	95	chat, coding, agents/tool workflows	The 24GB Ollama Q4 size leaves very little room on single 24GB GPUs once context and runtime overhead are included	24 GB	34.8 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Qwen3-Coder 30B-A3B	94	chat, coding, agents/tool workflows	vision/image understanding, Long context leaves little headroom on single 24GB GPUs	19 GB	27.6 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Qwen3.6 27B	94	chat, coding, agents/tool workflows	Long multimodal context can still eat the headroom on a single 24GB GPU	17 GB	24.6 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Devstral 2 123B	94	chat, coding, agents/tool workflows	vision/image understanding, 75GB Q4_K_M footprint is not realistic for single consumer GPUs; expect server GPU, multi-GPU, unified memory, or cloud fallback	75 GB	108.8 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
GLM-4.7-Flash	94	chat, coding, agents/tool workflows	vision/image understanding, Requires a recent/pre-release Ollama runtime according to the Ollama page; 19GB Q4 leaves limited headroom on single 24GB GPUs at long context	19 GB	27.6 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Qwen3 30B-A3B Instruct 2507	93	chat, coding, agents/tool workflows	vision/image understanding, Native 262k context can exceed practical 24GB headroom; reduce context or use larger memory for long runs	19 GB	27.6 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
gpt-oss 120B	93	chat, coding, agents/tool workflows	vision/image understanding, Not realistic for consumer single-GPU setups below 80GB-class memory	65 GB	94.3 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Qwen3-VL 30B-A3B Instruct	93	chat, agents/tool workflows, vision/image understanding	coding, Single 24GB GPUs have limited headroom for multiple images or very long context; use 32GB+ for comfort	20 GB	29 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Qwen2.5 Coder 32B	92	chat, coding, agents/tool workflows	vision/image understanding, Little VRAM headroom on single 24GB GPUs with long context	21 GB	30.4 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Qwen3.5 27B	92	chat, coding, agents/tool workflows	Long multimodal context can exceed single 24GB headroom	17 GB	24.6 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
GLM-5.2	92	chat, coding, agents/tool workflows	vision/image understanding, No vision input and impractical for normal consumer GPUs; even 2-bit GGUF needs roughly 238-254 GB before runtime overhead	466 GB	675.7 GB	Huge local model. Treat as a server, multi-GPU, very large unified-memory, or hosted API workload.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Gemma 4 31B	91	chat, coding, agents/tool workflows	Single 24GB GPUs have limited headroom for long context	20 GB	29 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Devstral Small 2 24B	91	chat, coding, agents/tool workflows	Large-context coding work is tight below 24GB VRAM	15 GB	21.8 GB	Workstation-local candidate. Prefer 24GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
gpt-oss 20B	89	chat, coding, agents/tool workflows	vision/image understanding, 12GB GPUs need offload or smaller fallback models	14 GB	20.3 GB	Workstation-local candidate. Prefer 24GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Llama 3.1 70B Instruct	88	chat, coding, agents/tool workflows	vision/image understanding, Too large for single 24GB consumer GPUs without heavy offload	44 GB	63.8 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Qwen2.5 Coder 14B	88	chat, coding, agents/tool workflows	vision/image understanding, Can be tight on 12GB GPUs at longer context	10.5 GB	15.2 GB	Practical 12GB local-agent candidate at Q4 with headroom checks.	none
Qwen3.5 9B	88	chat, coding, agents/tool workflows	Still a small model for large repo-scale coding tasks	6.6 GB	9.57 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Qwen3-VL 8B Instruct	88	chat, agents/tool workflows, vision/image understanding	coding, Vision workloads increase memory pressure with high-resolution images and long context; not a specialist coding model	6.5 GB	9.42 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Qwen3 4B Thinking 2507	87	chat, coding, agents/tool workflows	vision/image understanding, Thinking mode can be slower and the 256k context claim still needs practical VRAM headroom	3.2 GB	4.64 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Qwen3 8B	86	chat, coding, agents/tool workflows	vision/image understanding, Less capable than 14B/32B models for large tasks	6 GB	8.7 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Gemma 3 27B	84	chat, agents/tool workflows, vision/image understanding	coding, Less specialized for code than Qwen Coder	18 GB	26.1 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Gemma 4 E4B	83	chat, coding, agents/tool workflows	Smaller effective model; not ideal for deep repository-scale coding	9.6 GB	13.9 GB	Practical 12GB local-agent candidate at Q4 with headroom checks.	none
Qwen2.5 Coder 7B	82	chat, coding, agents/tool workflows	vision/image understanding, reasoning, Larger refactors and complex multi-file reasoning	5.5 GB	7.97 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Mixtral 8x7B	82	chat, coding, agents/tool workflows	vision/image understanding, Not practical for 24GB single-GPU setups without offload	28 GB	40.6 GB	Large local model. Prefer 48GB+ VRAM, multi-GPU, cloud GPU, or hosted API fallback.	runpod-cloud-gpu-fallback, apiroute-cloud-api-comparison
Phi-4 14B	82	chat, coding, agents/tool workflows	vision/image understanding, Smaller ecosystem than Llama/Qwen families	10.5 GB	15.2 GB	Practical 12GB local-agent candidate at Q4 with headroom checks.	none
Llama 3.1 8B Instruct	78	chat, agents/tool workflows, Fast local chat, lightweight agents, low-cost local testing	coding, vision/image understanding, reasoning	6 GB	8.7 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
DeepSeek R1 Distill Qwen 32B	78	chat, coding, reasoning	agents/tool workflows, vision/image understanding, Tight VRAM headroom and slower agent loops	21 GB	30.4 GB	Workstation-local candidate. Prefer 32GB+ VRAM for agents or long context.	apiroute-cloud-api-comparison
Gemma 3 12B	78	chat, agents/tool workflows, vision/image understanding	coding, Not primarily a coding model	9 GB	13.1 GB	Practical 12GB local-agent candidate at Q4 with headroom checks.	none
DeepSeek R1 Distill Qwen 14B	76	chat, coding, reasoning	agents/tool workflows, vision/image understanding, Less ergonomic for fast Telegram-style assistant responses	10.5 GB	15.2 GB	Practical 12GB local-agent candidate at Q4 with headroom checks.	none
Mistral 7B	74	chat, agents/tool workflows, Fast local chat and simple agent tasks	coding, vision/image understanding, reasoning	5.5 GB	7.97 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
DeepSeek-R1-0528-Qwen3-8B	72	chat, coding, reasoning	agents/tool workflows, vision/image understanding, Verbose reasoning can slow simple agent workflows	6 GB	8.7 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none
Gemma 3 4B	70	chat, agents/tool workflows, vision/image understanding	coding, reasoning, Limited quality for coding and complex tasks	3.5 GB	5.08 GB	Good small-local-model candidate for 8GB+ GPUs at Q4.	none