Skip to main content

Using Ollama With Roo Code

Roo Code supports running models locally using Ollama. This provides privacy, offline access, and potentially lower costs, but requires more setup and a powerful computer.

Website: https://ollama.com/


Setting up Ollama

  1. Download and Install Ollama: Download the Ollama installer for your operating system from the Ollama website. Follow the installation instructions. Make sure Ollama is running

    ollama serve
  2. Download a Model: Ollama supports many different models. You can find a list of available models on the Ollama website. Some recommended models for coding tasks include:

    • codellama:7b-code (good starting point, smaller)
    • codellama:13b-code (better quality, larger)
    • codellama:34b-code (even better quality, very large)
    • qwen2.5-coder:32b
    • mistralai/Mistral-7B-Instruct-v0.1 (good general-purpose model)
    • deepseek-coder:6.7b-base (good for coding tasks)
    • llama3:8b-instruct-q5_1 (good for general tasks)

    To download a model, open your terminal and run:

    ollama pull <model_name>

    For example:

    ollama pull qwen2.5-coder:32b
  3. Configure the Model: Configure your model's context window in Ollama and save a copy.

    Default Context Behavior

    Roo Code automatically defers to the Modelfile's num_ctx setting by default. When you use a model with Ollama, Roo Code reads the model's configured context window and uses it automatically. You don't need to configure context size in Roo Code settings - it respects what's defined in your Ollama model.

    Option A: Interactive Configuration

    Load the model (we will use qwen2.5-coder:32b as an example):

    ollama run qwen2.5-coder:32b

    Change context size parameter:

    /set parameter num_ctx 32768

    Save the model with a new name:

    /save your_model_name

    Option B: Using a Modelfile (Recommended)

    Create a Modelfile with your desired configuration:

    # Example Modelfile for reduced context
    FROM qwen2.5-coder:32b

    # Set context window to 32K tokens (reduced from default)
    PARAMETER num_ctx 32768

    # Optional: Adjust temperature for more consistent output
    PARAMETER temperature 0.7

    # Optional: Set repeat penalty
    PARAMETER repeat_penalty 1.1

    Then create your custom model:

    ollama create qwen-32k -f Modelfile
    Override Context Window

    If you need to override the model's default context window:

    • Permanently: Save a new model version with your desired num_ctx using either method above
    • Roo Code behavior: Roo automatically uses whatever num_ctx is configured in your Ollama model
    • Memory considerations: Reducing num_ctx helps prevent out-of-memory errors on limited hardware
  4. Configure Roo Code:

    • Open the Roo Code sidebar ( icon).
    • Click the settings gear icon ().
    • Select "ollama" as the API Provider.
    • Enter the model tag or saved name from the previous step (e.g., your_model_name).
    • (Optional) Configure the base URL if you're running Ollama on a different machine. The default is http://localhost:11434.
    • (Optional) Enter an API Key if your Ollama server requires authentication.
    • (Advanced) Roo uses Ollama's native API by default for the "ollama" provider. An OpenAI-compatible /v1 handler also exists but isn't required for typical setups.

Tips and Notes

  • Resource Requirements: Running large language models locally can be resource-intensive. Make sure your computer meets the minimum requirements for the model you choose.
  • Model Selection: Experiment with different models to find the one that best suits your needs.
  • Offline Use: Once you've downloaded a model, you can use Roo Code offline with that model.
  • Token Tracking: Roo Code tracks token usage for models run via Ollama, helping you monitor consumption.
  • Ollama Documentation: Refer to the Ollama documentation for more information on installing, configuring, and using Ollama.

Troubleshooting

Out of Memory (OOM) on First Request

Symptoms

  • First request from Roo fails with an out-of-memory error
  • GPU/CPU memory usage spikes when the model first loads
  • Works after you manually start the model in Ollama

Cause If no model instance is running, Ollama spins one up on demand. During that cold start it may allocate a larger context window than expected. The larger context window increases memory usage and can exceed available VRAM or RAM. This is an Ollama startup behavior, not a Roo Code bug.

Fixes

  1. Preload the model

    ollama run &lt;model-name&gt;

    Keep it running, then issue the request from Roo.

  2. Pin the context window (num_ctx)

    • Option A — interactive session, then save:
      # inside `ollama run &lt;base-model&gt;`
      /set parameter num_ctx 32768
      /save &lt;your_model_name&gt;
    • Option B — Modelfile (recommended for reproducibility):
      FROM &lt;base-model&gt;
      PARAMETER num_ctx 32768
      # Adjust based on your available memory:
      # 16384 for ~8GB VRAM
      # 32768 for ~16GB VRAM
      # 65536 for ~24GB+ VRAM
      Then create the model:
      ollama create &lt;your_model_name&gt; -f Modelfile
  3. Ensure the model's context window is pinned Save your Ollama model with an appropriate num_ctx (via /set + /save, or preferably a Modelfile). Roo Code automatically detects and uses the model's configured num_ctx - there is no manual context size setting in Roo Code for the Ollama provider.

  4. Use smaller variants If GPU memory is limited, use a smaller quant (e.g., q4 instead of q5) or a smaller parameter size (e.g., 7B/13B instead of 32B).

  5. Restart after an OOM

    ollama ps
    ollama stop &lt;model-name&gt;

Quick checklist

  • Model is running before Roo request
  • num_ctx pinned (Modelfile or /set + /save)
  • Model saved with appropriate num_ctx (Roo uses this automatically)
  • Model fits available VRAM/RAM
  • No leftover Ollama processes