Running Local LLMs With Ollama: Developer Setup Guide
On this page
Running LLMs locally isn't just a novelty anymore — it's a practical tool for everyday development. Whether you need code completion without sending proprietary code to an API, want to prototype AI features without burning through credits, or just need a model available offline on a flight, Ollama makes the whole process dead simple.
TL;DR: Ollama lets you run models like Llama 3.1, Codestral, Qwen 2.5, and Gemma 2 locally with a single command. It exposes an OpenAI-compatible API on localhost:11434, works on macOS/Linux/Windows, and requires as little as 8GB of RAM for smaller models. This guide covers installation, model selection, API integration, and real-world dev workflows.
Why Run LLMs Locally?
Local LLMs solve three problems that cloud APIs can't: privacy, cost, and availability. When you're working with proprietary codebases, client data, or regulated industries, sending every prompt to OpenAI or Anthropic might not be an option. Running locally means your data never leaves your machine.
Cost matters too. If you're making hundreds of API calls during development — testing prompts, iterating on chains, running evals — those tokens add up fast. A local model costs exactly $0 per token after the initial download. And availability is straightforward: no rate limits, no outages, no latency spikes. Your model runs when your machine runs.
The tradeoff is capability. Local models (7B-70B parameters) won't match Claude or GPT-4o on complex reasoning tasks. But for code completion, summarization, structured output generation, and simple chat — they're more than good enough.
What Hardware Do You Actually Need?
Ollama runs models using your system's RAM (or VRAM if you have a GPU). Here's a realistic breakdown of what you need:
- 7B models (Llama 3.1 7B, Qwen 2.5 7B): 8GB RAM minimum, 16GB recommended. These run comfortably on any modern MacBook.
- 13B-14B models (Qwen 2.5 14B): 16GB RAM minimum. Expect ~10-20 tokens/sec on an M1/M2 Mac.
- 34B-70B models (Llama 3.1 70B, Codestral): 32-64GB RAM. These need serious hardware but deliver noticeably better results.
Apple Silicon Macs are genuinely excellent for local inference. The unified memory architecture means your 32GB M2 Max can run a 34B model at reasonable speeds without a discrete GPU. On Linux, an NVIDIA GPU with 12GB+ VRAM will give you faster inference via CUDA.
Installing Ollama
Ollama is available on macOS, Linux, and Windows. Installation takes under a minute.
macOS (Homebrew):
brew install ollama
macOS (direct download):
curl -fsSL https://ollama.com/install.sh | sh
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com and run it. WSL2 is also supported if you prefer that route.
After installation, start the Ollama server:
ollama serve
On macOS, the desktop app starts the server automatically. On Linux, it runs as a systemd service. Verify it's running:
curl http://localhost:11434
# Should return: Ollama is running
Pulling and Running Your First Model
Downloading a model is one command:
ollama pull llama3.1
This grabs the 7B variant by default (~4.7GB download). To chat with it immediately:
ollama run llama3.1
You're now running a local LLM. Type a prompt, get a response. No API keys, no accounts, no billing.
Which Model Should You Use?
Model selection depends on your use case. Here are my go-to picks as of mid-2026:
For general coding tasks:
ollama pull qwen2.5-coder:7b
Qwen 2.5 Coder is the best code-focused model in the 7B class. It handles TypeScript, Python, Go, and Rust well, and is fast enough for real-time use.
For code completion and inline suggestions:
ollama pull codestral:22b
Mistral's Codestral at 22B parameters hits a sweet spot — significantly better than 7B models, but still runnable on 16GB machines with quantization.
For general-purpose chat and reasoning:
ollama pull llama3.1:70b
If you have the RAM for it, the 70B Llama 3.1 is remarkably capable. It's the closest you'll get to cloud-tier quality locally.
List your downloaded models anytime:
ollama list
Using the Ollama API
This is where Ollama shines for developers. It exposes a REST API on localhost:11434 that's compatible with the OpenAI chat completions format. This means most tools and libraries that work with OpenAI also work with Ollama — just change the base URL.
Basic API Call
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Write a TypeScript function that debounces any async function",
"stream": false
}'
Chat Completions (OpenAI-Compatible)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "qwen2.5-coder:7b",
"messages": [
{"role": "system", "content": "You are a senior TypeScript developer."},
{"role": "user", "content": "Write a retry wrapper with exponential backoff."}
]
}'
Using with Node.js
const response = await fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "qwen2.5-coder:7b",
messages: [
{ role: "user", content: "Explain this error: Cannot read properties of undefined" }
],
temperature: 0.2,
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
Using with Python
import requests
response = requests.post("http://localhost:11434/v1/chat/completions", json={
"model": "llama3.1",
"messages": [
{"role": "user", "content": "Generate a SQL query to find duplicate emails in a users table"}
],
"temperature": 0.1,
})
print(response.json()["choices"][0]["message"]["content"])
Because the API is OpenAI-compatible, you can also use the official OpenAI SDK by just pointing it at localhost:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // required by the SDK but not used
});
const completion = await client.chat.completions.create({
model: "qwen2.5-coder:7b",
messages: [{ role: "user", content: "Refactor this function to use async/await" }],
});
Integrating Ollama Into Your Dev Workflow
VS Code Integration
Several VS Code extensions support Ollama as a backend. The Continue extension is the most mature — it provides tab completion, inline chat, and context-aware code generation all powered by your local model. Install it, set the provider to Ollama, pick your model, and you've got a local Copilot alternative.
Running Ollama in Docker
If you want to keep your dev environment clean — or you're running Ollama on a shared dev server — Docker is the way to go. If you're already using Docker Compose for local development, Ollama slots right in:
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
Drop the deploy.resources block if you don't have an NVIDIA GPU. Start it with docker compose up -d and the API is available at the same localhost:11434 endpoint.
Using Local Models with AI Coding Tools
If you're already using AI code assistants, many of them support local model backends. This is particularly useful when you want AI assistance but can't send code to external APIs — common in enterprise environments or when working on client projects through agencies like Adaptels.
Creating Custom Models with Modelfiles
Ollama's Modelfile lets you create custom model configurations — think of it as a Dockerfile for LLMs. This is useful for locking in system prompts, adjusting parameters, or building task-specific models.
FROM qwen2.5-coder:7b
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """
You are a senior full-stack developer. When asked to write code:
- Use TypeScript by default
- Include error handling
- Add brief comments for non-obvious logic
- Prefer functional patterns over classes
"""
Build and use it:
ollama create my-coder -f Modelfile
ollama run my-coder
Now every prompt starts with your preferred context baked in. I keep a few of these for different tasks — one for code review, one for writing tests, one for documentation.
Common Pitfalls and Fixes
Model runs slowly or system becomes unresponsive: You're likely running a model that's too large for your RAM. The system starts swapping to disk and everything grinds to a halt. Drop down to a smaller model or a more aggressive quantization (e.g., qwen2.5-coder:7b-q4_0).
Context window too small: By default, Ollama uses a 2048-token context window. For code tasks, you almost always want more. Set it per-request:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "...",
"options": { "num_ctx": 8192 }
}'
Or bake it into a Modelfile with PARAMETER num_ctx 8192.
Port conflicts: If something else is on port 11434, set a custom port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Models eating disk space: Models are stored in ~/.ollama/models. A 70B model can take 40GB+. Clean up models you're not using:
ollama rm llama3.1:70b
When to Use Local vs Cloud LLMs
Local models aren't a replacement for cloud APIs — they're complementary. Here's how I split the work:
Use local models for: code completion, boilerplate generation, commit message drafting, simple refactoring suggestions, structured output parsing, and any task involving sensitive code.
Use cloud models for: complex multi-step reasoning, large codebase analysis, nuanced code review, and tasks where quality matters more than speed or cost. Tools like Claude Code or Cursor are still significantly better for complex dev tasks.
The practical approach is to start with a local model, and only reach for a cloud API when the local model's output isn't good enough. You'll be surprised how often the local model handles it just fine.
Wrapping Up
Ollama has made local LLM inference genuinely practical for developers. The setup takes five minutes, the API is OpenAI-compatible so existing tooling just works, and Apple Silicon performance is good enough for real-time use. Start with qwen2.5-coder:7b if you have 16GB RAM, or llama3.1 for general tasks, and build from there.
The local LLM ecosystem is moving fast — new models drop monthly and each generation brings meaningful quality improvements. Once you've got Ollama running, staying current is just ollama pull model-name away.
Sources
Related Articles
How to Debug Node.js Memory Leaks (Step-by-Step Guide)
Learn how to detect, diagnose, and fix Node.js memory leaks using heap snapshots, Chrome DevTools, and clinic.js — with real code examples.
How to Set Up GitHub Actions for CI/CD (Beginner-Friendly Guide)
Learn how to set up GitHub Actions for CI/CD pipelines — from your first workflow file to automated deployments with real YAML examples.
Python Virtual Environments Explained: venv vs conda vs pyenv
A practical comparison of Python's venv, conda, and pyenv — when to use each, how to set them up, and which one fits your workflow.