Local LLMs with Ollama — Open WebUI · Continue.dev integration
Run Llama / Qwen / DeepSeek locally on Mac and Windows. Cost, privacy, offline — an alternative to cloud LLMs.
Cloud LLMs (Claude, GPT, Gemini) are powerful but come with downsides: monthly bills, privacy (company code leaves your network), internet dependency. If your machine can run a big-enough model, local becomes a real alternative.
I think local LLMs are not so much a replacement for cloud LLMs as they are a different regime. Not because the quality ceiling is lower — though it often is — but rather because the constraints are completely different: no API bill, no data leaving your machine, and a latency that's bounded by your own hardware. Understanding what local is actually good at, versus where cloud wins, is the real skill this guide is trying to build.
This guide runs Llama 3.3 / Qwen 2.5 / DeepSeek-Coder etc. on macOS and Windows with Ollama, adds a ChatGPT-style UI via Open WebUI, and integrates with VS Code through Continue.dev.
Audience: developers who use an LLM daily and are evaluating local for cost / privacy reasons. Specs: M1 Pro or better or a GPU with 16GB+ VRAM.
TL;DR
brew install ollamaor download from ollama.comollama pull llama3.3:70b, or the lighterqwen2.5-coder:7bollama run qwen2.5-coder:7b→ chat- Want a UI? Open WebUI (Docker — under a minute)
- VS Code integration: install Continue.dev and register Ollama in
~/.continue/config.json
Prerequisites
- Mac: M1 Pro or newer + 16GB+ unified memory (32GB recommended)
- Windows / Linux: NVIDIA GPU with 8GB+ VRAM (16GB+ recommended), or a strong CPU + 32GB+ RAM
- (Optional) Docker — for Open WebUI
1. When local LLMs fit
Good fits
- Sensitive code — company policy bans cloud LLMs (healthcare, finance, defense)
- Heavy repetition — 1000+ requests/day stacks up on cloud pricing
- Offline — planes, unstable internet
- Experiments — model comparisons, fine-tuning
- Low-latency automation — sub-200ms response loops
Poor fits
- State-of-the-art quality — Claude Opus / GPT-5 level isn't local yet
- Low-spec machine — 8GB Mac M1, 4GB GPU — even 7B models drag
- Multimodal variety — images / audio options are thin locally
- Long context — Claude's 1M-token window is hard to match locally
Reality: a local 70B model ≈ GPT-4 quality on a subset of tasks. Coding assist, summaries, translation — fine. Complex reasoning — cloud still wins.
2. Install Ollama
2.1 macOS
brew install ollamaOr .dmg from ollama.com/download.
Run as a GUI app or as a background service:
# Background service (recommended)
brew services start ollama2.2 Windows
winget install Ollama.OllamaAfter installing, the ollama icon appears in the system tray. Background-runs automatically.
2.3 Linux
curl -fsSL https://ollama.com/install.sh | sh2.4 Verify
ollama --version # ollama version 0.x.x
curl http://localhost:11434/api/version
# {"version":"0.x.x"}3. First model — a light coding assistant
Start with a small model (faster download, faster responses):
3.1 Qwen 2.5 Coder 7B
ollama pull qwen2.5-coder:7b # 4.7GB
ollama run qwen2.5-coder:7bChat:
>>> Write a Python function to merge two sorted lists.
/bye to exit.
3.2 Bigger models (if you have the specs)
| Model | Size | Recommended RAM | Use |
|---|---|---|---|
qwen2.5-coder:7b | 4.7GB | 16GB | Quick coding assist |
qwen2.5-coder:32b | 19GB | 48GB | Strong coding |
llama3.3:70b | 40GB | 64GB | General / strong |
deepseek-coder-v2:16b | 9GB | 24GB | Code-specialized |
gemma2:9b | 5.5GB | 16GB | Well-rounded |
phi3.5:3.8b | 2.2GB | 8GB | Lightest (works on slow GPUs) |
ollama pull llama3.3:70b
ollama run llama3.3:70bDownload time = model size ÷ bandwidth. On 1Gbps, 70B is ~6 minutes.
3.3 Manage models
ollama list # installed models
ollama rm qwen2.5-coder:7b # remove
ollama show llama3.3:70b # metadata4. Open WebUI — ChatGPT-style UI
Use a browser UI instead of the CLI. One-minute Docker setup.
4.1 Install
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainDocker setup: /mac/docker-setup or /windows/docker-wsl2.
4.2 First use
Browser → http://localhost:3000 → create the first account (local only).
Ollama is detected automatically. Pick a model from the top-left dropdown and chat.
4.3 Strengths
- Saved conversation history
- Markdown rendering + code syntax highlighting
- Talk to multiple models side-by-side (comparison)
- File upload + RAG (optional)
5. VS Code integration — Continue.dev
5.1 Install the extension
VS Code Extensions:
- Continue (
Continue.continue)
Or code --install-extension Continue.continue.
5.2 Register Ollama models
~/.continue/config.json (create if missing):
{
"models": [
{
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
},
{
"title": "Llama 3.3 70B",
"provider": "ollama",
"model": "llama3.3:70b"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}5.3 Use it
In VS Code:
Cmd + I(Mac) /Ctrl + I(Win) — inline chatCmd + L/Ctrl + L— chat sidebar- Tab — autocomplete (uses
tabAutocompleteModel)
Continue.dev is an open-source alternative to Cursor / Copilot. Besides local models, you can register Claude / GPT keys too.
6. Performance tuning
6.1 GPU first (NVIDIA / Apple Silicon)
Ollama auto-uses the GPU. Confirm:
ollama ps
# NAME ID SIZE PROCESSOR
# qwen2.5-coder:7b ... 5GB 100% GPUNo GPU shown? Then:
- macOS: Apple Silicon is automatic (Intel Mac is CPU only)
- NVIDIA: check the CUDA driver (
nvidia-smi)
6.2 Adjust model size
Each model has quantization variants (perf ↔ quality):
ollama pull qwen2.5-coder:7b-instruct-q4_K_M # default (4-bit, recommended)
ollama pull qwen2.5-coder:7b-instruct-q8_0 # 8-bit (more accurate, bigger)
ollama pull qwen2.5-coder:7b-instruct-fp16 # full precision (most accurate, largest)4-bit is the standard — barely-noticeable quality loss at half the memory.
6.3 Adjust context length
Default context is 4K. To extend:
ollama run qwen2.5-coder:7b
>>> /set parameter num_ctx 16384Or a Modelfile:
FROM qwen2.5-coder:7b
PARAMETER num_ctx 16384
Longer = more memory + slower responses.
7. Cost / power comparison
7.1 Cloud vs local (rough monthly)
| Usage | Claude Pro | Local (Ollama, incl. electricity) |
|---|---|---|
| Light (~100 prompts/mo) | $20/mo | M2 Pro electricity ~$3/mo |
| Medium (~1000 prompts/mo) | $100/mo (API) | ~$5/mo |
| Heavy (~10000 prompts/mo) | $500+/mo (API) | ~$15/mo |
But there's the up-front machine cost (M2 Pro 32GB ~$2,500). Break-even is 1–2 years even for heavy users.
7.2 Power (rough)
- M2 Pro at full load: ~30W → 24h = 0.72 kWh/day → 22 kWh/mo × $0.15 = $3.3
- NVIDIA 4090 at full load: ~450W → 24h at full load is unrealistic, but heavy real use ~$20-40/mo
Local only draws power while you use it. Idle = 0.
8. Verify
# 1. Ollama working
ollama --version
curl http://localhost:11434/api/version
# 2. Model installed
ollama list
# 3. Response test
echo "Write hello world in Rust" | ollama run qwen2.5-coder:7b
# 4. Open WebUI (if running)
curl http://localhost:3000
# 5. Continue.dev
code . # In VS Code: Cmd+L → pick model → chat9. Troubleshooting
ollama: command not found
- macOS: brew prefix on PATH (
/opt/homebrew/bin) - Windows: restart PowerShell
- Linux:
~/.ollama/binor system PATH
Responses are very slow
ollama psto confirm GPU usage. If CPU is at 100% the GPU isn't recognized- macOS Intel: no GPU acceleration — Apple Silicon recommended
- NVIDIA: CUDA driver + restart Ollama
- Not enough RAM: model size ÷ 8 ≤ available RAM, otherwise it crawls
"Out of memory" error
- Model is larger than RAM / VRAM
- Try a smaller model (
:3binstead of:7b) or a quantization (q4_K_M) - Close other apps and retry
Continue.dev doesn't see Ollama models
- Validate
~/.continue/config.json(jq . config.json) - Restart VS Code
- Confirm Ollama is running (
curl localhost:11434/api/version)
Want to block non-local access to Open WebUI
- Default is localhost-only — already blocked externally
- If exposing publicly, set
WEBUI_AUTH=True(required signup gate)
Model download interrupted (network blip)
- Re-run
ollama pull— it resumes - Check
~/.ollama/models/; if a file is corrupted, delete and retry
10. Recommended starting points
If you're just beginning:
- Mac M2 Pro 16GB:
qwen2.5-coder:7b+ Continue.dev - Mac M2 Pro 32GB:
qwen2.5-coder:32b+ Continue.dev + Open WebUI - Windows + RTX 4090:
llama3.3:70b+ Open WebUI - CPU-only 16GB:
phi3.5:3.8b(slow but functional)
After 3 months, compare your satisfaction to the cloud LLM you'd otherwise pay for, then decide.
11. What's next
- Claude Code setup — /ai-agents/claude-code — cloud counterpart
- Cursor setup — /ai-agents/cursor-setup — Cursor can also point at Ollama
- Multi-tool workflow — /ai-agents/multi-tool-workflow
- MCP servers — /ai-agents/mcp-servers — Ollama + MCP is possible
References
Changelog
- 2026-05-16: First draft. Ollama install + model comparison + Open WebUI + Continue.dev integration + perf tuning + cost comparison + six troubleshooting cases.
Keep reading
- OpenAI Codex CLI setup — comparison with Claude Code and Cursor
OpenAI's official coding-agent CLI. Install, auth, autonomy modes, and how it compares to Claude Code — all in 30 minutes.
- GitHub Copilot setup — splitting work between Copilot, Claude Code, and Cursor
30-minute setup for Copilot in VS Code and JetBrains. Plus a decision table for which jobs go to Copilot vs Claude Code/Cursor.
- Running Claude Code, Cursor, and Copilot together
Practical workflow for using all three AI tools at once without conflicts — task-to-tool mapping, keybinding clashes, context isolation.