Skip to content

Local LLM Setup Guide

Run AI models locally for privacy and cost savings using Ollama.

Why Use Local LLMs?

Advantages

  • Privacy: Code never leaves your machine
  • Cost: Free after initial download
  • No API Keys: No account setup needed
  • Offline: Works without internet
  • Unlimited: No rate limits

Trade-offs

  • Performance: Slower than cloud APIs
  • Quality: May not match GPT-4/Claude level
  • Hardware: Requires sufficient RAM

Prerequisites

Hardware Requirements

Model SizeRAM RequiredRecommended
3B params6 GBOlder MacBooks, 8GB RAM systems
7B params10 GBModern laptops, 16GB RAM recommended
14B+ params20+ GBDesktop computers, M1/M2/M3 Macs

Supported Operating Systems

  • macOS: 11+ (Big Sur or later)
  • Linux: All modern distributions
  • Windows: WSL2 or native (experimental)

Installing Ollama

macOS

bash
# Download and install
curl -fsSL https://ollama.ai/install.sh | sh

Or download from ollama.ai

Linux

bash
curl -fsSL https://ollama.ai/install.sh | sh

Windows

  1. Download Ollama for Windows from ollama.ai
  2. Run the installer
  3. Start Ollama from Start Menu

Best for: Coding, development, technical tasks

bash
# Download 7B version (10GB RAM)
ollama pull qwen2.5-coder:7b

# Download 3B version (6GB RAM)
ollama pull qwen2.5-coder:3b

Llama 3.2

Best for: General purpose, chat

bash
# 3B version
ollama pull llama3.2:3b

# Latest Llama
ollama pull llama3.2

DeepSeek-Coder

Best for: Code generation, debugging

bash
ollama pull deepseek-coder:6.7b

Configuring IfAI

Step 1: Start Ollama

Make sure Ollama is running:

bash
# Check Ollama status
ollama list

# Should show available models

Step 2: Configure in IfAI

  1. Open IfAI Settings (Cmd+,)
  2. Go to AI Provider > Ollama
  3. Select model from dropdown:
    • qwen2.5-coder:7b
    • qwen2.5-coder:3b
    • llama3.2
    • Or any model you've downloaded

Step 3: Test Connection

  1. Open AI Chat (Cmd+K)
  2. Send a message:
    Hello! Can you hear me?
  3. Should receive a response

Performance Optimization

Model Selection

For coding tasks:

qwen2.5-coder:7b > deepseek-coder:6.7b > llama3.2

For general chat:

llama3.2 > qwen2.5-coder:7b

For low-memory systems:

qwen2.5-coder:3b > phi3

Reducing Memory Usage

  1. Use smaller models: 3B instead of 7B
  2. Reduce context: Limit to 2048 tokens
  3. Adjust num_ctx: In settings, reduce context window
  4. Quantization: Use Q4 quantized models

Improving Speed

  1. Use GPU acceleration: Ollama auto-detects GPU
  2. Reduce context size: Fewer tokens = faster
  3. Close other apps: Free up RAM
  4. Use smaller model: 3B vs 7B

Advanced Configuration

Custom Model Parameters

In IfAI Settings > Ollama > Advanced:

json
{
  "num_ctx": 2048,
  "temperature": 0.7,
  "top_p": 0.9,
  "num_predict": 512
}

Multiple Models

Switch between models in Settings:

  • Use qwen2.5-coder for coding
  • Use llama3.2 for explanation
  • Use deepseek-coder for debugging

Troubleshooting

"Ollama not detected"

Solutions:

  1. Verify Ollama is running: ollama list
  2. Restart IfAI
  3. Check Ollama is in PATH
  4. Manual Ollama URL in settings: http://localhost:11434

"Out of memory"

Solutions:

  1. Use smaller model (3B instead of 7B)
  2. Close other applications
  3. Reduce context window
  4. Restart Ollama

Very slow responses

Solutions:

  1. Check if GPU is being used
  2. Use smaller model
  3. Reduce context size
  4. Try different quantization

Model download fails

Solutions:

  1. Check internet connection
  2. Try again later (server may be busy)
  3. Download manually:
    bash
    ollama pull qwen2.5-coder:7b --verbose

Hybrid Mode

Configure IfAI to use both local and cloud:

Smart routing:

  • Local: Simple tasks, quick questions
  • Cloud: Complex tasks, large codebases

Setup:

  1. Settings > AI Provider > Hybrid
  2. Set local as primary
  3. Configure cloud as fallback

Next Steps

Released under the MIT License.