ollama run llama3 downloads and runs Llama 3, automatically selecting the right quantization for your hardware. It manages model downloads, provides an API server, and handles hardware detection.Ollama maintains a registry of models (similar to Docker Hub) where popular models are available in pre-configured quantizations. Running ollama pull mistral downloads Mistral-7B at a reasonable quantization for your system. The tool detects your hardware (CPU, Apple Silicon, NVIDIA GPU) and configures inference accordingly. It exposes an HTTP API on localhost:11434 that's compatible with many AI tools and frameworks.
Ollama's "Modelfile" concept lets you customize models by specifying a base model, system prompt, temperature, and other parameters — like a Dockerfile for AI models. You can create custom variants: ollama create my-assistant -f Modelfile. This makes it easy to experiment with different system prompts and parameters without touching model weights.
Ollama is typically one layer in a local AI stack: Ollama for model serving, Open WebUI for a chat interface, and various tools that connect via the API (Continue for IDE integration, LangChain for application frameworks). This stack gives you a fully private, cost-free AI setup that runs entirely on your hardware. For privacy-sensitive applications and development work, it's increasingly competitive with cloud APIs.