Introduction
Lately I’ve been experimenting with different engines and different models in my home lab. I’ve used llama.cpp, ollama, and lm studio to host models at various times. My reason for experimenting with different engines is due to the heterogenous hardware.
For AI workloads I have the following machines available:
| Hostname | AI Hardware Description |
|---|---|
| cranberry | 2x Nvidia P40s with 24gb of VRAM each |
| banana | 1x Nvidia 4070TI with 16gb of VRAM |
| 1x AMD 7600m with 8gb of VRAM in the same | |
| 1x AMD 890m with 96gb of unified RAM | |
| clementine | 1x Mac Mini M4 Pro with 24gb of unified RAM |
| star | 1x Mac Studio M3 Ultra with 96gb of unified RAM |
CUDA works well with ollama and llama.cpp on the Nvidia GPUs. MLX through LM Studio performs the best on the Mac Mini and Mac Studio, but relies on LM Studio which is closed source. ROCm and Vulkan perform decently on the AMD iGPU and dGPUs.
Since I’ve been incorporating AI workloads into more of my daily life, changing engines and models involves updating multiple client configurations. I’ve reached the point where this is tedious and I want a more streamlined approach for incorporating new models and hardware without changing client configurations. When a client is configured with an older model, interactions with the engines either fail because the model no longer exists on the host, or the engine requires a model swap which can cause thrashing. LiteLLM resolves this problem by providing a unified model router.
Using LiteLLM
What is LiteLLM
LiteLLM is a high-performance middleware that allows you to call 100+ LLMs using the OpenAI format. In a heterogeneous home lab like mine, it acts as a translation layer and load balancer.
It primarily operates in two modes:
- Python SDK: Programmatically call models within your own scripts.
- Model Proxy Server: An OpenAI-compatible server that sits between your clients (like Home Assistant or OpenWeb UI) and your various backends (Ollama, vLLM, etc.).
High Availability with Docker Swarm
For maximum reliability, I deploy the LiteLLM Proxy in Docker Swarm mode. This ensures that the proxy is always available, even if a specific node goes down. By running multiple replicas of the proxy, we can perform rolling updates without interrupting the AI services that power my home. This hosting strategy follows the same principles of high availability and automation described in my CD workflow post.
Security and Routing with Traefik
To handle external access and security, I use Traefik as a reverse proxy. Traefik automatically manages Let’s Encrypt certificates, providing full TLS encryption for all AI traffic. This is crucial when automated AI agents or mobile devices need to reach the router securely from outside the local network.
Using LiteLLM as a Router
My current workloads involve OpenClaw running Agents for various activities, using OpenCode and Cline for coding projects, using Home Assistant with local voice assistants, and using OpenWeb UI as a general chat interface.
Based on those workloads, I setup the following endpoints:
| LiteLLM Model Group Alias | Hosts | Model Group | Application Use cases |
|---|---|---|---|
| reasoning_fast | clementine, banana | nemotron-3-nano:4b | Home Assistant voice assistant |
| reasoning_slow | cranberry, star | nemotron-cascade-2:30b | OpenClaw |
| tools_fast | clementine, banana | nemotron-3-nano:4b | OpenClaw |
| tools_slow | cranberry, star | nemotron-cascade-2:30b | OpenClaw |
| vision | banana | qwen3.5:9b | Frigate, Home Assistant |
| embedding | banana, clementine | embeddinggemma | OpenClaw |
| coding | cranberry, star | nemotron-cascade-2:30b | OpenCode, Client |
From a client perspective, these are all accessed through litellm with the respective endpoints. In the future, upgrading a model is a single configuration change in one place.
Visualizing the Router
1. Fast Reasoning with Fallback
This diagram shows how reasoning_fast (optimized for speed on Apple Silicon) automatically fails over to the more capable reasoning_slow (running on high-VRAM Nvidia GPUs) if the primary host is offline.
graph LR subgraph Clients HA[Home Assistant] end subgraph "LiteLLM Router" RF[Alias: reasoning_fast] end subgraph Backend C[nemotron-3-nano:4b] S[nemotron-cascade-2:30b] end HA --> RF RF -- Primary (Order 1) --> C RF -- Fallback (Order 2) --> S style C fill:#d4f1f9,stroke:#333 style S fill:#f9d4d4,stroke:#333
2. Unified Vision Endpoint
This diagram illustrates how multiple distinct applications (Frigate for security and Home Assistant for general tasks) all point to a single vision alias, which LiteLLM then routes to the specific hardware capable of vision inference.
graph LR subgraph Clients H[Home Assistant] F[Frigate] end subgraph "LiteLLM Router" V[Alias: vision] end subgraph Backend Q[banana: qwen3.5:9b] end H --> V F --> V V --> Q
Another benefit of LiteLLM is the ability to setup fallback models. For example, if clementine is down, reasoning_fast can fallback to star.
Creating the Models and Model Group Aliases
LiteLLM distinguishes between the Model Name (what the client sees) and the LiteLLM Model Name (the specific model/engine on the backend).
- Models: These are the specific instances of a model running on a specific host (e.g., Nemotron on
star). - Model Group Aliases: These are functional groupings (e.g.,
coding) that map to one or more models.
Setting up Fallbacks
Fallbacks ensure that if your primary “fast” model is unavailable or overloaded, the request is automatically routed to a “slow” but more capable model.
For my lab, I’ve configured the following chain:
reasoning_fast→reasoning_slowtools_fast→tools_slowcoding→reasoning_slow
Setting up a preference in a Model Group
Where I have model overlap between a GPU instance and an Apple Silicon instance, I prefer the Apple Silicon hardware for energy efficiency. By setting the order property in the configuration, I can prioritize clementine (M4 Pro) over banana (4070TI) for light workloads.
Setting up an Automatic Complexity Router
For clients that only allow one provider, there is a tradeoff between speed and accuracy. Home Assistant is a great example: turning off a light is a “Simple” task, while asking for a summary of a day’s events is “Complex.”
LiteLLM provides complexity routing, which scores the request and routes it based on tiered complexity:
| Complexity Tier | Current Model Mapping |
|---|---|
| Simple (< 100 characters) | nemotron-3-nano:4b |
| Medium ( >= 100 characters) | nemotron-3-nano:4b |
| Complex ( >= 500 characters) | nemotron-3-super:120b_q5 |
| Reasoning | nemotron-3-super:120b_q5 |
Example LiteLLM Configuration File
The following configuration demonstrates how these concepts come together. To see the line numbers referenced in the descriptions below, ensure you are viewing this in a Quartz-compatible environment.
1. Global & LiteLLM Settings
Lines 1–6 define the global behavior of the proxy, including background health checks every 15 minutes (900 seconds) to ensure the router doesn’t send traffic to a dead host.
litellm_settings:
check_provider_endpoint: true
default_fallbacks: ["nemotron-cascade-2:30b"]
enable_background_health_checks: true
health_check_interval: 9002. Router & Alias Setup
Lines 8–24 configure the Router Strategy. Here we use simple-shuffle with weighted order. The model_group_alias section (Lines 11–19) maps our functional use cases (like coding) to specific model groups. The fallbacks section (Lines 20–24) defines the safety net for each alias.
router_settings:
routing_strategy: simple-shuffle
model_group_alias:
"reasoning_fast": "nemotron-3-nano:4b"
"reasoning_slow": "nemotron-cascade-2:30b"
"tools_fast": "nemotron-3-nano:4b"
"tools_slow": "nemotron-cascade-2:30b"
"vision": "qwen3.5:9b"
"embedding": "embeddinggemma"
"coding": "nemotron-cascade-2:30b"
"default": "nemotron-cascade-2:30b"
fallbacks:
- "reasoning_fast": ["reasoning_slow", "coding"]
- "reasoning_slow": ["coding"]
- "tools_fast": ["tools_slow", "coding"]
- "tools_slow": ["coding"]
- "coding": ["reasoning_slow", "tools_slow"]3. The Model List & Hardware Mapping
Lines 29–85 contain the heart of the hardware mapping.
- Prioritization: Notice the
orderkey (e.g., Lines 34 and 41). Models withorder: 1are tried beforeorder: 2, allowing us to prioritize energy-efficient Macs over power-hungry GPUs. - Complexity Routing: Lines 75–85 define the
smart-router. This virtual model uses thecomplexity_routerto decide whether to send a request to thefastorslowtiers based on the input text.
model_list:
- model_name: "nemotron-3-nano:4b"
litellm_params:
model: ollama_chat/nemotron-3-nano:4b
api_base: http://clementine:11434
order: 1
model_info:
mode: completion
- model_name: "nemotron-3-nano:4b"
litellm_params:
model: ollama_chat/nemotron-3-nano:4b
api_base: http://banana:11434
order: 2
model_info:
mode: completion
- model_name: "nemotron-cascade-2:30b"
litellm_params:
model: ollama_chat/nemotron-cascade-2:30b
api_base: https://ollama
order: 2
model_info:
mode: completion
- model_name: "nemotron-cascade-2:30b"
litellm_params:
model: ollama_chat/nemotron-cascade-2:30b
api_base: http://star:11434
order: 1
model_info:
mode: completion
- model_name: "qwen3.5:9b"
litellm_params:
model: ollama_chat/qwen3.5:9b
api_base: http://banana:11434
model_info:
mode: completion
- model_name: "embeddinggemma"
litellm_params:
model: ollama_chat/embeddinggemma
api_base: http://banana:11434
order: 2
model_info:
mode: embedding
- model_name: "embeddinggemma"
litellm_params:
model: ollama_chat/embeddinggemma
api_base: http://clementine:11434
order: 1
model_info:
mode: embedding
- model_name: smart-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: reasoning_fast
MEDIUM: reasoning_fast
COMPLEX: reasoning_slow
REASONING: reasoning_slow
complexity_router_default_model: reasoning_fastConclusion
Setting up an LLM router like LiteLLM has fundamentally changed how I interact with my home lab. By abstracting the hardware and specific models behind a functional alias, I’ve created a future-proof interface for my AI workloads. Whether I’m deploying new AI agents for content review or upgrading to the latest open-source model, the complexity is now managed in a single configuration file rather than across dozens of individual clients.
The goal of a home lab is often experimentation, and an LLM router is the key to making that experimentation seamless.
Update: After benchmarking models across all four hosts, the alias-to-model assignments above were revised. See Local LLM Benchmarks: April 2026 for the methodology and final configuration.