How to Set Up Local AI Coding (No Cloud Required)

Updated 2025-03-10 · 7 min read · 1318 words

Running AI coding tools locally means your code never leaves your machine. No API costs, no internet required, no privacy policy to worry about. This guide covers every step from hardware assessment to a working IDE setup.

Why Go Local?

The case for local AI is strongest in these scenarios:

Proprietary codebase: Client contracts, NDAs, or internal policy prohibit sending code to third-party APIs
High volume usage: If you're running agents that make hundreds of API calls per day, local inference can be cheaper than cloud after hardware amortization
Offline development: Planes, trains, remote sites, corporate networks with outbound restrictions
Latency sensitivity: Local inference on a good GPU can be faster than round-trip API calls for small models
Privacy by principle: Some developers simply prefer keeping their work on their own hardware

The tradeoff: local models are still behind frontier models (GPT-4o, Claude 3.5 Sonnet) on complex reasoning and very large codebase tasks. For most daily coding tasks — completions, simple refactors, test writing — the gap has narrowed enough to be practical.

Hardware Requirements

Your GPU is the primary constraint. Here's what you can run at each tier:

GPU	VRAM	Models That Fit Comfortably	Approx. Speed (tok/s)
NVIDIA RTX 3060	12GB	DeepSeek Coder 6.7B Q8, Qwen2.5-Coder 7B Q8	30–50 tok/s
NVIDIA RTX 3080/4070	10–12GB	DeepSeek Coder V2 16B Q4, Llama 3.1 8B Q8	40–70 tok/s
NVIDIA RTX 4080	16GB	DeepSeek Coder V2 16B Q6, StarCoder2 15B Q8	50–90 tok/s
NVIDIA RTX 4090	24GB	DeepSeek Coder V2 16B Q8, Qwen2.5-Coder 32B Q4	60–110 tok/s
2× A100 / H100 (80GB)	160GB	DeepSeek V3 full, Qwen2.5-Coder 72B Q8	100–200 tok/s
Apple M3 Max	96GB unified	Qwen2.5-Coder 72B Q4	30–60 tok/s
Apple M2/M3 Pro	18–36GB unified	Qwen2.5-Coder 32B Q4, DeepSeek Coder V2 16B	20–50 tok/s

💡 Apple Silicon is Surprisingly Good

Apple Silicon uses unified memory — your 36GB M3 Pro can run a 32B parameter model at Q4 quantization in that same memory pool, which no discrete 24GB GPU can do. For macOS developers, Apple Silicon is the best local AI platform per dollar below the $3,000 tier.

CPU-only: Works with 7B models at 3–8 tokens/second. Usable for chat with patience; not practical for real-time completion.

Minimum practical setup: Any NVIDIA GPU with 8GB+ VRAM, or Apple Silicon with 16GB+ unified memory.

Model Benchmarks

Choose your model based on VRAM budget and task type:

Model	Parameters	HumanEval	MBPP	VRAM (Q4)	VRAM (Q8)	Best For
Qwen2.5-Coder 32B	32B	90.2%	90.1%	20GB	36GB	Best local quality
DeepSeek Coder V2	16B	86.1%	87.7%	10GB	18GB	Best quality/VRAM ratio
Qwen2.5-Coder 14B	14B	85.9%	88.2%	9GB	16GB	16GB GPU target
StarCoder2 15B	15B	80.4%	76.6%	10GB	17GB	Code completion focus
Qwen2.5-Coder 7B	7B	88.4%	83.5%	5GB	9GB	Low VRAM, surprising quality
DeepSeek Coder 6.7B	6.7B	78.6%	75.1%	4GB	8GB	8GB GPU users
CodeLlama 34B	34B	62.4%	57.0%	22GB	40GB	Older but still capable

Note: HumanEval and MBPP scores are for base/instruct variants. Actual completion quality in IDE use varies — these benchmarks measure standalone code generation. For comparison: GPT-4o scores ~90.2% on HumanEval, Claude 3.5 Sonnet ~92%.

Step 1: Install Ollama

Ollama is the de facto standard for running local LLMs. It handles model downloads, quantization selection, and exposes a REST API that most AI coding tools can speak to.

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com/download
# Then verify install:
ollama --version

Pull your first model:

# Best option for 12GB+ VRAM (2.5GB download, Q4 quantized):
ollama pull qwen2.5-coder:7b

# Best option for 16GB+ VRAM (10GB download):
ollama pull deepseek-coder-v2:16b

# Best option for 24GB+ VRAM (20GB download):
ollama pull qwen2.5-coder:32b

# Verify it works:
ollama run qwen2.5-coder:7b "Write a Python function to flatten a nested list"

Ollama starts a server on http://localhost:11434 automatically when you run a model. It stays running in the background.

⚠️ Model Storage

Models are stored in ~/.ollama/models on macOS/Linux. A 16B Q4 model is ~10GB. Make sure your home directory has space. You can set OLLAMA_MODELS=/path/to/larger/drive to redirect storage.

Step 2: Set Up Continue in VS Code

Continue is the best open-source AI coding plugin, and it connects directly to Ollama.

# Install from VS Code Extensions panel:
# Search: "Continue - Codestral, Claude, and more"
# Or via CLI:
code --install-extension Continue.continue

Edit the Continue config file (~/.continue/config.json):

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "contextLength": 32768
    },
    {
      "title": "DeepSeek Coder V2 16B (Local)",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b",
      "contextLength": 163840
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder 7B (Autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  },
  "contextProviders": [
    { "name": "code" },
    { "name": "docs" },
    { "name": "diff" },
    { "name": "terminal" },
    { "name": "open" }
  ]
}

Reload VS Code. You should see a Continue panel on the left with your local models available.

💡 Use Separate Models for Chat vs. Completion

Use a larger model (14B–32B) for chat — these queries are less latency-sensitive and quality matters more. Use a smaller model (7B) for real-time tab completion — sub-100ms latency matters here and 7B models are fast enough on modern GPUs.

Step 3: Set Up Aider for Local Agentic Coding

Aider is the best local-compatible coding agent. It runs in your terminal and connects to Ollama.

pip install aider-chat

# Start with DeepSeek Coder V2 (if you have it pulled)
aider --model ollama/deepseek-coder-v2:16b

# Or with Qwen 32B for best local quality:
aider --model ollama/qwen2.5-coder:32b

# Useful flags:
aider --model ollama/deepseek-coder-v2:16b \
  --no-auto-commits \       # Review before committing
  --watch-files \           # Auto-add changed files
  --map-tokens 2048          # How much of codebase map to include

Once in the Aider REPL:

> /add src/api/users.ts src/services/userService.ts
> Add input validation to the createUser endpoint using zod

Step 4: Configure JetBrains IDE (Optional)

If you use IntelliJ, PyCharm, or another JetBrains IDE:

Install the Continue JetBrains plugin from the marketplace
The configuration is shared with VS Code (~/.continue/config.json)
Alternatively, install LLM Plugin for a lighter-weight option

Cost Analysis: Local vs. Cloud

Scenario: Developer making 500 AI queries/day (completions + chat + agent tasks)

Cloud (Copilot Pro + Claude API):
  Copilot Pro:        $10/mo
  Claude Sonnet API:  ~$25/mo at moderate agent use
  Total:              ~$35/mo = $420/year

Local (one-time hardware + electricity):
  RTX 4080 GPU:       $700 (amortized over 3 years = $233/year)
  Electricity (150W): ~$10/mo = $120/year
  Ollama + software:  $0
  Total year 1:       $820 (hardware + running costs)
  Total year 2+:      ~$120/year

Break-even: ~20 months for a heavy user paying $35/month in cloud costs.
For a $20/month cloud user: break-even is ~35 months.

The math favors cloud for most individual developers. Local makes sense if: (1) you have strong privacy requirements that cloud tools can't satisfy, (2) you already have suitable hardware, or (3) you're running a team with multiple developers.

ℹ️ Hybrid Setup is the Pragmatic Choice

Many developers run local models for completion (low latency, no cost, fine for 7B quality) while keeping a Claude or GPT-4o API key for complex chat/agent tasks where frontier model quality matters. This costs $5–15/month in API fees vs. $35–50 for all-cloud.

Tools Mentioned

Continue

Open-source AI code assistant

Free 4.4/5

Aider

AI pair programming in your terminal

Free 4.5/5

Zed

High-performance AI-native editor

FAQ

Can I run AI coding tools without a GPU?

Yes, but expect 3–8 tokens per second on CPU vs. 30–100+ on a GPU. That speed is workable for chat (you can read the response as it streams) but not for real-time tab completion. For practical daily use, a GPU with 8GB+ VRAM is the minimum worth the setup effort. Apple Silicon (M1/M2/M3) is excellent — unified memory means even the base 8GB M2 can run 7B models quickly.

Which local model is best for coding in 2025?

Qwen2.5-Coder 32B is the best local model if you have 24GB+ VRAM — it scores 90.2% on HumanEval, comparable to GPT-4o. For 12–16GB VRAM, DeepSeek Coder V2 16B is the best quality-to-VRAM ratio. For 8GB GPUs, Qwen2.5-Coder 7B punches well above its weight at 88.4% HumanEval. Avoid the older CodeLlama models — newer alternatives beat them at every size.

Is Ollama the only option for local models?

Ollama is the easiest. Alternatives: LM Studio (GUI-based, good for non-terminal users), llama.cpp directly (most control, most complex), vLLM (for serving to multiple users/tools simultaneously), Jan.ai (privacy-focused GUI). For IDE integration, Ollama's REST API is the most widely supported by coding plugins like Continue and Aider.

How do local models compare to GPT-4o for real coding work?

For routine tasks — boilerplate, test writing, simple refactors, code explanation — Qwen2.5-Coder 32B is very close to GPT-4o in practice. The gap shows up in: complex multi-file reasoning, understanding subtle business logic, debugging highly unusual errors, and tasks requiring broad world knowledge alongside coding. If 90% of your AI usage is routine coding tasks, local models are a practical choice.

What is the best setup for privacy-conscious teams?

For teams: deploy a vLLM or Ollama server on internal infrastructure (a single A100/H100 server can serve 10–20 concurrent developers). Each developer's IDE plugin (Continue, Copilot alternative) points to the internal endpoint. Code never leaves your network. Qwen2.5-Coder 72B on an A100 80GB provides near-frontier quality with full data control. This is the architecture financial services and healthcare teams typically adopt.