Introduction

Lately I’ve been experimenting with different engines and different models in my home lab. I’ve used llama.cpp, ollama, and lm studio to host models at various times. My reason for experimenting with different engines is due to the heterogenous hardware.

For AI workloads I have the following machines available:

Hostname	AI Hardware Description
cranberry	2x Nvidia P40s with 24gb of VRAM each
banana	1x Nvidia 4070TI with 16gb of VRAM
	1x AMD 7600m with 8gb of VRAM in the same
	1x AMD 890m with 96gb of unified RAM
clementine	1x Mac Mini M4 Pro with 24gb of unified RAM
star	1x Mac Studio M3 Ultra with 96gb of unified RAM

CUDA works well with ollama and llama.cpp on the Nvidia GPUs. MLX through LM Studio performs the best on the Mac Mini and Mac Studio, but relies on LM Studio which is closed source. ROCm and Vulkan perform decently on the AMD iGPU and dGPUs.

Since I’ve been incorporating AI workloads into more of my daily life, changing engines and models involves updating multiple client configurations. I’ve reached the point where this is tedious and I want a more streamlined approach for incorporating new models and hardware without changing client configurations. When a client is configured with an older model, interactions with the engines either fail because the model no longer exists on the host, or the engine requires a model swap which can cause thrashing. LiteLLM resolves this problem by providing a unified model router.

Using LiteLLM

What is LiteLLM

LiteLLM is a high-performance middleware that allows you to call 100+ LLMs using the OpenAI format. In a heterogeneous home lab like mine, it acts as a translation layer and load balancer.

It primarily operates in two modes:

Python SDK: Programmatically call models within your own scripts.
Model Proxy Server: An OpenAI-compatible server that sits between your clients (like Home Assistant or OpenWeb UI) and your various backends (Ollama, vLLM, etc.).

High Availability with Docker Swarm

For maximum reliability, I deploy the LiteLLM Proxy in Docker Swarm mode. This ensures that the proxy is always available, even if a specific node goes down. By running multiple replicas of the proxy, we can perform rolling updates without interrupting the AI services that power my home. This hosting strategy follows the same principles of high availability and automation described in my CD workflow post.

Security and Routing with Traefik

To handle external access and security, I use Traefik as a reverse proxy. Traefik automatically manages Let’s Encrypt certificates, providing full TLS encryption for all AI traffic. This is crucial when automated AI agents or mobile devices need to reach the router securely from outside the local network.

Using LiteLLM as a Router

My current workloads involve OpenClaw running Agents for various activities, using OpenCode and Cline for coding projects, using Home Assistant with local voice assistants, and using OpenWeb UI as a general chat interface.

Based on those workloads, I setup the following endpoints:

LiteLLM Model Group Alias	Hosts	Model Group	Application Use cases
reasoning_fast	clementine, banana	nemotron-3-nano:4b	Home Assistant voice assistant
reasoning_slow	cranberry, star	nemotron-cascade-2:30b	OpenClaw
tools_fast	clementine, banana	nemotron-3-nano:4b	OpenClaw
tools_slow	cranberry, star	nemotron-cascade-2:30b	OpenClaw
vision	banana	qwen3.5:9b	Frigate, Home Assistant
embedding	banana, clementine	embeddinggemma	OpenClaw
coding	cranberry, star	nemotron-cascade-2:30b	OpenCode, Client

From a client perspective, these are all accessed through litellm with the respective endpoints. In the future, upgrading a model is a single configuration change in one place.

Visualizing the Router

1. Fast Reasoning with Fallback

This diagram shows how reasoning_fast (optimized for speed on Apple Silicon) automatically fails over to the more capable reasoning_slow (running on high-VRAM Nvidia GPUs) if the primary host is offline.

graph LR
    subgraph Clients
        HA[Home Assistant]
    end

    subgraph "LiteLLM Router"
        RF[Alias: reasoning_fast]
    end

    subgraph Backend
        C[nemotron-3-nano:4b]
        S[nemotron-cascade-2:30b]
    end

    HA --> RF
    RF -- Primary (Order 1) --> C
    RF -- Fallback (Order 2) --> S
    
    style C fill:#d4f1f9,stroke:#333
    style S fill:#f9d4d4,stroke:#333

2. Unified Vision Endpoint

This diagram illustrates how multiple distinct applications (Frigate for security and Home Assistant for general tasks) all point to a single vision alias, which LiteLLM then routes to the specific hardware capable of vision inference.

graph LR
    subgraph Clients
        H[Home Assistant]
        F[Frigate]
    end

    subgraph "LiteLLM Router"
        V[Alias: vision]
    end

    subgraph Backend
        Q[banana: qwen3.5:9b]
    end

    H --> V
    F --> V
    V --> Q

Another benefit of LiteLLM is the ability to setup fallback models. For example, if clementine is down, reasoning_fast can fallback to star.

Creating the Models and Model Group Aliases

LiteLLM distinguishes between the Model Name (what the client sees) and the LiteLLM Model Name (the specific model/engine on the backend).

Models: These are the specific instances of a model running on a specific host (e.g., Nemotron on star).
Model Group Aliases: These are functional groupings (e.g., coding) that map to one or more models.

Setting up Fallbacks

Fallbacks ensure that if your primary “fast” model is unavailable or overloaded, the request is automatically routed to a “slow” but more capable model.

For my lab, I’ve configured the following chain:

reasoning_fast → reasoning_slow
tools_fast → tools_slow
coding → reasoning_slow

Setting up a preference in a Model Group

Where I have model overlap between a GPU instance and an Apple Silicon instance, I prefer the Apple Silicon hardware for energy efficiency. By setting the order property in the configuration, I can prioritize clementine (M4 Pro) over banana (4070TI) for light workloads.

Setting up an Automatic Complexity Router

For clients that only allow one provider, there is a tradeoff between speed and accuracy. Home Assistant is a great example: turning off a light is a “Simple” task, while asking for a summary of a day’s events is “Complex.”

LiteLLM provides complexity routing, which scores the request and routes it based on tiered complexity:

Complexity Tier	Current Model Mapping
Simple (< 100 characters)	nemotron-3-nano:4b
Medium ( >= 100 characters)	nemotron-3-nano:4b
Complex ( >= 500 characters)	nemotron-3-super:120b_q5
Reasoning	nemotron-3-super:120b_q5

Example LiteLLM Configuration File

The following configuration demonstrates how these concepts come together. To see the line numbers referenced in the descriptions below, ensure you are viewing this in a Quartz-compatible environment.

1. Global & LiteLLM Settings

Lines 1–6 define the global behavior of the proxy, including background health checks every 15 minutes (900 seconds) to ensure the router doesn’t send traffic to a dead host.

litellm_settings:
  check_provider_endpoint: true
  default_fallbacks: ["nemotron-cascade-2:30b"] 
  enable_background_health_checks: true
  health_check_interval: 900

2. Router & Alias Setup

Lines 8–24 configure the Router Strategy. Here we use simple-shuffle with weighted order. The model_group_alias section (Lines 11–19) maps our functional use cases (like coding) to specific model groups. The fallbacks section (Lines 20–24) defines the safety net for each alias.

router_settings:
  routing_strategy: simple-shuffle
  model_group_alias:
    "reasoning_fast": "nemotron-3-nano:4b"
    "reasoning_slow": "nemotron-cascade-2:30b"
    "tools_fast": "nemotron-3-nano:4b"
    "tools_slow": "nemotron-cascade-2:30b"
    "vision": "qwen3.5:9b"
    "embedding": "embeddinggemma"
    "coding": "nemotron-cascade-2:30b"
    "default": "nemotron-cascade-2:30b"
  fallbacks:
    - "reasoning_fast": ["reasoning_slow", "coding"]
    - "reasoning_slow": ["coding"]
    - "tools_fast": ["tools_slow", "coding"]
    - "tools_slow": ["coding"]
    - "coding": ["reasoning_slow", "tools_slow"]

3. The Model List & Hardware Mapping

Lines 29–85 contain the heart of the hardware mapping.

Prioritization: Notice the order key (e.g., Lines 34 and 41). Models with order: 1 are tried before order: 2, allowing us to prioritize energy-efficient Macs over power-hungry GPUs.
Complexity Routing: Lines 75–85 define the smart-router. This virtual model uses the complexity_router to decide whether to send a request to the fast or slow tiers based on the input text.

model_list:
  - model_name: "nemotron-3-nano:4b"
    litellm_params:
      model: ollama_chat/nemotron-3-nano:4b
      api_base: http://clementine:11434
      order: 1
    model_info:
      mode: completion
  - model_name: "nemotron-3-nano:4b"
    litellm_params:
      model: ollama_chat/nemotron-3-nano:4b
      api_base: http://banana:11434
      order: 2
    model_info:
      mode: completion
  - model_name: "nemotron-cascade-2:30b"
    litellm_params:
      model: ollama_chat/nemotron-cascade-2:30b
      api_base: https://ollama
      order: 2
    model_info:
      mode: completion
  - model_name: "nemotron-cascade-2:30b"
    litellm_params:
      model: ollama_chat/nemotron-cascade-2:30b
      api_base: http://star:11434
      order: 1
    model_info:
      mode: completion
  - model_name: "qwen3.5:9b"
    litellm_params:
      model: ollama_chat/qwen3.5:9b
      api_base: http://banana:11434
    model_info:
      mode: completion
  - model_name: "embeddinggemma"
    litellm_params:
      model: ollama_chat/embeddinggemma
      api_base: http://banana:11434
      order: 2
    model_info:
      mode: embedding
  - model_name: "embeddinggemma"
    litellm_params:
      model: ollama_chat/embeddinggemma
      api_base: http://clementine:11434
      order: 1
    model_info:
      mode: embedding
  - model_name: smart-router
    litellm_params:
      model: auto_router/complexity_router
      complexity_router_config:
        tiers:
          SIMPLE: reasoning_fast
          MEDIUM: reasoning_fast
          COMPLEX: reasoning_slow
          REASONING: reasoning_slow
      complexity_router_default_model: reasoning_fast

Conclusion

Setting up an LLM router like LiteLLM has fundamentally changed how I interact with my home lab. By abstracting the hardware and specific models behind a functional alias, I’ve created a future-proof interface for my AI workloads. Whether I’m deploying new AI agents for content review or upgrading to the latest open-source model, the complexity is now managed in a single configuration file rather than across dozens of individual clients.

The goal of a home lab is often experimentation, and an LLM router is the key to making that experimentation seamless.

Update: After benchmarking models across all four hosts, the alias-to-model assignments above were revised. See Local LLM Benchmarks: April 2026 for the methodology and final configuration.

Hacks with Robots

Explorer

Backlinks

You need an LLM Router

Introduction

Using LiteLLM

What is LiteLLM

High Availability with Docker Swarm

Security and Routing with Traefik

Using LiteLLM as a Router

Visualizing the Router

1. Fast Reasoning with Fallback

2. Unified Vision Endpoint

Creating the Models and Model Group Aliases

Setting up Fallbacks

Setting up a preference in a Model Group

Setting up an Automatic Complexity Router

Example LiteLLM Configuration File

1. Global & LiteLLM Settings

2. Router & Alias Setup

3. The Model List & Hardware Mapping

Conclusion

Graph View

Table of Contents

HwRHacks with Robots

Explorer

Backlinks

You need an LLM Router

Introduction

Using LiteLLM

What is LiteLLM

High Availability with Docker Swarm

Security and Routing with Traefik

Using LiteLLM as a Router

Visualizing the Router

1. Fast Reasoning with Fallback

2. Unified Vision Endpoint

Creating the Models and Model Group Aliases

Setting up Fallbacks

Setting up a preference in a Model Group

Setting up an Automatic Complexity Router

Example LiteLLM Configuration File

1. Global & LiteLLM Settings

2. Router & Alias Setup

3. The Model List & Hardware Mapping

Conclusion

Graph View

Table of Contents

Hacks with Robots