devAlice
← AI Agents

Local LLMs with Ollama — Open WebUI · Continue.dev integration

Run Llama / Qwen / DeepSeek locally on Mac and Windows. Cost, privacy, offline — an alternative to cloud LLMs.

Cloud LLMs (Claude, GPT, Gemini) are powerful but come with downsides: monthly bills, privacy (company code leaves your network), internet dependency. If your machine can run a big-enough model, local becomes a real alternative.

I think local LLMs are not so much a replacement for cloud LLMs as they are a different regime. Not because the quality ceiling is lower — though it often is — but rather because the constraints are completely different: no API bill, no data leaving your machine, and a latency that's bounded by your own hardware. Understanding what local is actually good at, versus where cloud wins, is the real skill this guide is trying to build.

This guide runs Llama 3.3 / Qwen 2.5 / DeepSeek-Coder etc. on macOS and Windows with Ollama, adds a ChatGPT-style UI via Open WebUI, and integrates with VS Code through Continue.dev.

Audience: developers who use an LLM daily and are evaluating local for cost / privacy reasons. Specs: M1 Pro or better or a GPU with 16GB+ VRAM.

TL;DR

  1. brew install ollama or download from ollama.com
  2. ollama pull llama3.3:70b, or the lighter qwen2.5-coder:7b
  3. ollama run qwen2.5-coder:7b → chat
  4. Want a UI? Open WebUI (Docker — under a minute)
  5. VS Code integration: install Continue.dev and register Ollama in ~/.continue/config.json

Prerequisites

  • Mac: M1 Pro or newer + 16GB+ unified memory (32GB recommended)
  • Windows / Linux: NVIDIA GPU with 8GB+ VRAM (16GB+ recommended), or a strong CPU + 32GB+ RAM
  • (Optional) Docker — for Open WebUI

1. When local LLMs fit

Good fits

  • Sensitive code — company policy bans cloud LLMs (healthcare, finance, defense)
  • Heavy repetition — 1000+ requests/day stacks up on cloud pricing
  • Offline — planes, unstable internet
  • Experiments — model comparisons, fine-tuning
  • Low-latency automation — sub-200ms response loops

Poor fits

  • State-of-the-art quality — Claude Opus / GPT-5 level isn't local yet
  • Low-spec machine — 8GB Mac M1, 4GB GPU — even 7B models drag
  • Multimodal variety — images / audio options are thin locally
  • Long context — Claude's 1M-token window is hard to match locally

Reality: a local 70B model ≈ GPT-4 quality on a subset of tasks. Coding assist, summaries, translation — fine. Complex reasoning — cloud still wins.


2. Install Ollama

2.1 macOS

brew install ollama

Or .dmg from ollama.com/download.

Run as a GUI app or as a background service:

# Background service (recommended)
brew services start ollama

2.2 Windows

winget install Ollama.Ollama

Or ollama.com/download.

After installing, the ollama icon appears in the system tray. Background-runs automatically.

2.3 Linux

curl -fsSL https://ollama.com/install.sh | sh

2.4 Verify

ollama --version            # ollama version 0.x.x
curl http://localhost:11434/api/version
# {"version":"0.x.x"}

3. First model — a light coding assistant

Start with a small model (faster download, faster responses):

3.1 Qwen 2.5 Coder 7B

ollama pull qwen2.5-coder:7b      # 4.7GB
ollama run qwen2.5-coder:7b

Chat:

>>> Write a Python function to merge two sorted lists.

/bye to exit.

3.2 Bigger models (if you have the specs)

ModelSizeRecommended RAMUse
qwen2.5-coder:7b4.7GB16GBQuick coding assist
qwen2.5-coder:32b19GB48GBStrong coding
llama3.3:70b40GB64GBGeneral / strong
deepseek-coder-v2:16b9GB24GBCode-specialized
gemma2:9b5.5GB16GBWell-rounded
phi3.5:3.8b2.2GB8GBLightest (works on slow GPUs)
ollama pull llama3.3:70b
ollama run llama3.3:70b

Download time = model size ÷ bandwidth. On 1Gbps, 70B is ~6 minutes.

3.3 Manage models

ollama list                          # installed models
ollama rm qwen2.5-coder:7b           # remove
ollama show llama3.3:70b             # metadata

4. Open WebUI — ChatGPT-style UI

Use a browser UI instead of the CLI. One-minute Docker setup.

4.1 Install

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Docker setup: /mac/docker-setup or /windows/docker-wsl2.

4.2 First use

Browser → http://localhost:3000 → create the first account (local only).

Ollama is detected automatically. Pick a model from the top-left dropdown and chat.

4.3 Strengths

  • Saved conversation history
  • Markdown rendering + code syntax highlighting
  • Talk to multiple models side-by-side (comparison)
  • File upload + RAG (optional)

5. VS Code integration — Continue.dev

5.1 Install the extension

VS Code Extensions:

  • Continue (Continue.continue)

Or code --install-extension Continue.continue.

5.2 Register Ollama models

~/.continue/config.json (create if missing):

{
  "models": [
    {
      "title": "Qwen Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    },
    {
      "title": "Llama 3.3 70B",
      "provider": "ollama",
      "model": "llama3.3:70b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

5.3 Use it

In VS Code:

  • Cmd + I (Mac) / Ctrl + I (Win) — inline chat
  • Cmd + L / Ctrl + L — chat sidebar
  • Tab — autocomplete (uses tabAutocompleteModel)

Continue.dev is an open-source alternative to Cursor / Copilot. Besides local models, you can register Claude / GPT keys too.


6. Performance tuning

6.1 GPU first (NVIDIA / Apple Silicon)

Ollama auto-uses the GPU. Confirm:

ollama ps
# NAME                  ID    SIZE  PROCESSOR
# qwen2.5-coder:7b     ...   5GB   100% GPU

No GPU shown? Then:

  • macOS: Apple Silicon is automatic (Intel Mac is CPU only)
  • NVIDIA: check the CUDA driver (nvidia-smi)

6.2 Adjust model size

Each model has quantization variants (perf ↔ quality):

ollama pull qwen2.5-coder:7b-instruct-q4_K_M      # default (4-bit, recommended)
ollama pull qwen2.5-coder:7b-instruct-q8_0         # 8-bit (more accurate, bigger)
ollama pull qwen2.5-coder:7b-instruct-fp16         # full precision (most accurate, largest)

4-bit is the standard — barely-noticeable quality loss at half the memory.

6.3 Adjust context length

Default context is 4K. To extend:

ollama run qwen2.5-coder:7b
>>> /set parameter num_ctx 16384

Or a Modelfile:

FROM qwen2.5-coder:7b
PARAMETER num_ctx 16384

Longer = more memory + slower responses.


7. Cost / power comparison

7.1 Cloud vs local (rough monthly)

UsageClaude ProLocal (Ollama, incl. electricity)
Light (~100 prompts/mo)$20/moM2 Pro electricity ~$3/mo
Medium (~1000 prompts/mo)$100/mo (API)~$5/mo
Heavy (~10000 prompts/mo)$500+/mo (API)~$15/mo

But there's the up-front machine cost (M2 Pro 32GB ~$2,500). Break-even is 1–2 years even for heavy users.

7.2 Power (rough)

  • M2 Pro at full load: ~30W → 24h = 0.72 kWh/day → 22 kWh/mo × $0.15 = $3.3
  • NVIDIA 4090 at full load: ~450W → 24h at full load is unrealistic, but heavy real use ~$20-40/mo

Local only draws power while you use it. Idle = 0.


8. Verify

# 1. Ollama working
ollama --version
curl http://localhost:11434/api/version
 
# 2. Model installed
ollama list
 
# 3. Response test
echo "Write hello world in Rust" | ollama run qwen2.5-coder:7b
 
# 4. Open WebUI (if running)
curl http://localhost:3000
 
# 5. Continue.dev
code .   # In VS Code: Cmd+L → pick model → chat

9. Troubleshooting

ollama: command not found

  • macOS: brew prefix on PATH (/opt/homebrew/bin)
  • Windows: restart PowerShell
  • Linux: ~/.ollama/bin or system PATH

Responses are very slow

  • ollama ps to confirm GPU usage. If CPU is at 100% the GPU isn't recognized
  • macOS Intel: no GPU acceleration — Apple Silicon recommended
  • NVIDIA: CUDA driver + restart Ollama
  • Not enough RAM: model size ÷ 8 ≤ available RAM, otherwise it crawls

"Out of memory" error

  • Model is larger than RAM / VRAM
  • Try a smaller model (:3b instead of :7b) or a quantization (q4_K_M)
  • Close other apps and retry

Continue.dev doesn't see Ollama models

  • Validate ~/.continue/config.json (jq . config.json)
  • Restart VS Code
  • Confirm Ollama is running (curl localhost:11434/api/version)

Want to block non-local access to Open WebUI

  • Default is localhost-only — already blocked externally
  • If exposing publicly, set WEBUI_AUTH=True (required signup gate)

Model download interrupted (network blip)

  • Re-run ollama pull — it resumes
  • Check ~/.ollama/models/; if a file is corrupted, delete and retry

If you're just beginning:

  1. Mac M2 Pro 16GB: qwen2.5-coder:7b + Continue.dev
  2. Mac M2 Pro 32GB: qwen2.5-coder:32b + Continue.dev + Open WebUI
  3. Windows + RTX 4090: llama3.3:70b + Open WebUI
  4. CPU-only 16GB: phi3.5:3.8b (slow but functional)

After 3 months, compare your satisfaction to the cloud LLM you'd otherwise pay for, then decide.


11. What's next


References

Changelog

  • 2026-05-16: First draft. Ollama install + model comparison + Open WebUI + Continue.dev integration + perf tuning + cost comparison + six troubleshooting cases.

Keep reading