ollama | James Cherti

Running Large Language Models on your machine can enhance your projects, but the setup is often complex. Ollama simplifies this by packaging everything needed to run a Large Language Model. Here’s a concise guide on using Ollama to run LLMs locally.

Requirements

CPU: A modern x86-64 CPU with AVX2 support is recommended. AVX512 can further improve inference performance for some workloads. (If your CPU does not support AVX, see Ollama Issue #2187: Support GPU runners on CPUs without AVX.)
RAM: A minimum of 16GB is recommended for a decent experience when running models with 7 billion parameters.
Disk Space: A practical minimum of 40GB of disk space is advisable.
GPU: While a GPU is not mandatory, it is recommended for enhanced performance in model inference. Refer to the list of GPUs that are compatible with Ollama. Approximate VRAM requirements for quantized models are roughly 4-6 GB for 7B models, 8-12 GB for 13B models, and substantially more for larger models such as 30B or 70B variants. Actual requirements vary depending on quantization level, context window size, and how much of the model is offloaded to the GPU.
NVIDIA GPUs: Ollama relies on CUDA for NVIDIA GPU acceleration (You can find the instructions to install CUDA on Debian here). CPU-only inference is also supported, and macOS systems use Apple’s Metal backend for GPU acceleration.

Step 1: Install Ollama

Download and install Ollama for Linux using:

curl -fsSL https://ollama.com/install.sh | shCode language: plaintext (plaintext)

Step 2: Download a Large Language Model

Download a specific large language model using the Ollama command:

ollama pull gemma2:2bCode language: plaintext (plaintext)

The command above downloads the Gemma2 model by Google DeepMind. You can find other models by visiting the Ollama Library.

(At the time of writing, downloading gemma2:2b retrieves a quantized variant such as gemma2:2b-instruct-q4_0, indicating that it retrieves a quantized version of the 2-billion-parameter model specifically optimized for instruction-following tasks like chatbots. This quantization process reduces the model’s numerical precision from floating-point representations such as FP16 or BF16 into lower-bit integer formats such as 4-bit quantization q4_0. This reduces memory usage and improves inference speed, at the cost of a modest reduction in model quality.)

Step 3: Chat with the model

Run the large language model:

ollama run gemma2:2bCode language: plaintext (plaintext)

This launches an interactive REPL where you can interact with the model.

Step 4: Install Open WebUI (web interface)

Open WebUI offers a user-friendly interface for interacting with Large Language Models downloaded via Ollama. It enables users to run and customize models without requiring extensive programming knowledge.

It can be installed using pip within a Python virtual environment:

mkdir -p ~/.python-venv/open-webui
python -m venv ~/.python-venv/open-webui
source ~/.python-venv/open-webui/bin/activate
pip install open-webuiCode language: plaintext (plaintext)

NOTE: Alternatively, Open WebUI can also be installed using Docker.

Finally, execute the following command to start the Open WebUI server:

~/.python-venv/open-webui/bin/open-webui serveCode language: plaintext (plaintext)

You will also have to execute Ollama as a server simultaneously with Open WebUI:

ollama serve

NOTE: On Linux installations, Ollama is commonly started manually using ollama serve. On macOS and Windows desktop installations, the Ollama background service is typically started automatically.

After initiating both the Open WebUI and Ollama server processes, the web application binds to localhost. Open your web browser and navigate to http://localhost:8080 to access the interface. During your initial visit, the system will prompt you to register an administrator account. Once authenticated, you can select your downloaded models, such as gemma2:2b, from the workspace dropdown menu and initiate a chat session directly within your browser.

When starting a new conversation in Open WebUI, a “Model not selected” message may appear, preventing input. This behavior is expected because Open WebUI acts as a central hub for potentially dozens of different Large Language Models and requires an explicit model selection before starting a conversation. To resolve this, locate the dropdown menu at the top center of the chat interface, often labeled “Select a model.”

Frequently Asked Questions

Troubleshooting: Open WebUI: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN

While running Open WebUI or pulling certain models, you might see the following warning in your terminal:

Warning: You are sending unauthenticated requests to the HF Hub.
Please set a HF_TOKEN to enable higher rate limits and faster downloads.Code language: plaintext (plaintext)

This warning appears because the system is attempting to interact with the Hugging Face API anonymously, which falls under a lower rate limit tier. To fix this, you need to provide a Hugging Face access token via an environment variable.

Here is how to generate and apply the token:

Go to https://huggingface.co and log into your account.
Navigate to your account settings and click on “Access Tokens”.
Click “Create new token”, give it a descriptive name, select the “Read” role, and generate it.
Copy the newly created token string.
Export this token as an environment variable in your terminal session before starting your server.

You can apply the token directly in the terminal before running the web interface:

export HF_TOKEN='your_huggingface_token_here'
~/.python-venv/open-webui/bin/open-webui serveCode language: Bash (bash)

To make this change permanent, append the export HF_TOKEN='your_huggingface_token_here' line to your shell configuration file, such as ~/.bashrc, and reload your shell.

Troubleshooting: Open WebUI High GPU and/or CPU Usage and Follow-Up Auto-Generation

If you notice that your GPU and/or CPU is constantly pinned at 100%, even after the model has finished answering your prompt, it may be caused by Open WebUI running background tasks.

By default, Open WebUI often tries to anticipate your next move by automatically generating follow-up questions or summarizing your chat for the sidebar title. While helpful, these auto-generation features can place a significant, continuous strain on your system resources, especially if you are running models strictly on your CPU.

To alleviate this, you can:

Disable Background Tasks: Go to Settings -> General and toggle off Title Auto-Generation, Follow-Up Auto-Generation, Chat Tags Auto-Generation.
Force Single-Tasking: Prevent Ollama from splitting its focus by setting export OLLAMA_NUM_PARALLEL=1 as an environment variable. This ensures requests queue up rather than processing simultaneously and slowing down.
Limit Memory Usage: If you have low VRAM, set export OLLAMA_MAX_LOADED_MODELS=1. This prevents Ollama from trying to hold multiple models in your GPU memory at the same time.
Check for Ghost Requests: Run export OLLAMA_DEBUG=1 before starting ollama serve. Watch the terminal logs to see if Open WebUI is sending unexpected requests (like embeddings or pings) that are keeping the LLM active.
Manual Model Unloading: If GPU memory remains allocated after a chat completes, set export OLLAMA_KEEP_ALIVE=0 to force the model to unload immediately after each response. Note that this increases latency because the model must reload for every request.

The model cuts off mid-sentence or mid-thought

Increasing the token limit allows the model the logical breathing room required to complete its internal reasoning process and deliver a final, coherent answer.

When working with reasoning models, the model essentially writes two responses: a hidden thought chain and the visible answer. If the token limit is too low, the model consumes its entire budget on the thinking phase, leaving no room for the actual solution.

Option 1: Increase the Max Tokens (Permanent Fix)

In Open WebUI, click the Control Sliders icon at the top right of your chat window.
Click on Settings.
Go to Advanced Parameters.
Look for max_tokens.
It is likely set to a default like 1024 or 2048. Increase this number significantly, such as to 4096 or 8192.
Click Save and try your prompt again.

Option 2: The Continue Response Button (Quick Fix)
If a model ever cuts off mid-sentence or mid-thought in Open WebUI, you don’t necessarily have to start over. You can usually just click the Continue Response button at the bottom of the chat, and the model will pick up typing exactly where it left off, finishing its thought process and moving on to your answer.

Pro-Tip for Reasoning Models: If you are asking a complex coding or configuration question, reasoning models will go down massive rabbit holes. If you don’t want to wait 2 minutes for an answer, you can add “Keep your reasoning brief” to the end of your prompt!

Conclusion

With Ollama, you can quickly run Large Language Models (LLMs) locally and integrate them into your projects. Additionally, Open WebUI provides a user-friendly interface for interacting with these models, making it easier to customize and deploy them without extensive programming knowledge.

James Cherti

Software Development and Linux Infrastructure

Tag Archives: ollama

Running Large Language Models locally with Ollama (compatible with Linux, macOS, and Windows)