How to Add GPU Inference to the Cluster¶
The OptiPlex 3080 Micro has no PCIe slot — it cannot take a discrete GPU. This doc covers the two paths to add GPU inference: external GPU nodes and eGPU via Thunderbolt.
Option A — Dedicated GPU Nodes (recommended)¶
Add one or more GPU-capable machines to the existing Talos cluster as dedicated inference workers. These run alongside the OptiPlex nodes — the cluster scheduler directs inference workloads to GPU nodes via node selectors.
Note: Talos Linux does not support NVIDIA drivers natively (it is an immutable, API-managed OS with no package manager). GPU nodes must run a standard Linux distribution (e.g., Ubuntu 22.04) and join the cluster as regular Kubernetes workers. The Talos control plane accepts any kubelet that presents a valid bootstrap token.
Hardware to consider¶
| Option | GPU | VRAM | Form factor | Est. cost | Notes |
|---|---|---|---|---|---|
| Used workstation (Dell T3660, HP Z4) | RTX 3090 | 24 GB | Tower | $800–1,200 | Best value for VRAM |
| Mini PC (ASUS NUC 14 Pro+) | Arc A770M (integrated) | 16 GB | Mini | $700–900 | Compact, lower power |
| NVIDIA Jetson AGX Orin | Ampere 2048-core | 32 GB unified | Edge | $2,000 | ARM, good for edge inference |
| Used cloud-decom server | A100 40GB / A10G | 40 GB | 1–2U rack | $3,000–8,000 | Fastest option |
Recommendation for Phase 2: One used workstation with an RTX 3090 (24GB VRAM) gives you ~90–112 tok/sec on Llama 3.1 8B Q4 (fully in VRAM). For 70B Q4 (~40 GB) you'd need CPU offloading, dropping to ~2–5 tok/sec — the 70B doesn't fit in 24GB VRAM. Budget ~$1,000–1,200 all-in.
Add a GPU node to the cluster¶
-
Install Ubuntu 22.04 + NVIDIA drivers on the GPU machine:
# Install NVIDIA drivers sudo apt install nvidia-driver-535 -y sudo reboot # Verify nvidia-smi -
Join the GPU node to the Kubernetes cluster:
Since the cluster runs Talos Linux, GPU nodes cannot use Talos machine configs (no NVIDIA driver support). Instead, install a standard kubelet on the Ubuntu GPU node and join it as a worker.
Generate a bootstrap token on the launcher box:
# Create a token valid for 24h
kubectl --context admin@iva token create --ttl 24h
On the GPU node, install kubeadm/kubelet and join:
# Install kubelet and kubeadm (match the cluster's Kubernetes version)
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
# Join the cluster (use the token from above)
sudo kubeadm join 192.168.10.32:6443 \
--token <BOOTSTRAP_TOKEN> \
--discovery-token-ca-cert-hash sha256:<CA_CERT_HASH>
-
Install NVIDIA device plugin:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm install nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --set failOnInitError=false -
Label the node:
kubectl label node <gpu-node-name> accelerator=nvidia-gpu kubectl label node <gpu-node-name> node-role=inference -
Verify GPU is visible to Kubernetes:
kubectl describe node <gpu-node-name> | grep nvidia # Should show: nvidia.com/gpu: 1
Deploy an inference workload¶
Use ollama or vllm as the inference server. Example with ollama:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
nodeSelector:
accelerator: nvidia-gpu # pin to GPU node
containers:
- name: ollama
image: ollama/ollama:latest
resources:
limits:
nvidia.com/gpu: 1 # request the GPU
requests:
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
volumes:
- name: ollama-models
hostPath:
path: /data/ollama-models # local NVMe on GPU node
type: DirectoryOrCreate
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: inference
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
Pull a model and test:
kubectl exec -it deploy/ollama -n inference -- ollama pull llama3.1:8b
kubectl exec -it deploy/ollama -n inference -- ollama run llama3.1:8b "Hello"
Option B — eGPU via Thunderbolt (limited)¶
The OptiPlex 3080 Micro has a Thunderbolt 3 port. An eGPU enclosure (Razer Core X, Sonnet Breakaway Box) can attach a full-size GPU.
Caveats: - Thunderbolt 3 = PCIe x4 bandwidth (~4 GB/s) vs native PCIe x16 (~32 GB/s) — but LLM generation happens inside the GPU, so the actual penalty is only ~5–15% for generation (model loading and prompt processing are slower) - NVIDIA drivers on Linux with eGPU require manual setup (no plug-and-play) - Not hot-swappable — node must be rebooted to attach/detach - One eGPU per node maximum
When to use: If you already have a GPU and an enclosure, this is a zero-hardware-cost path to test GPU inference on the existing nodes. Not recommended for production.
# Check Thunderbolt controller
lspci | grep -i thunderbolt
# Authorize the eGPU device (if security level requires it)
echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized
GPU Inference Performance (reference)¶
| GPU | VRAM | Model | Quantization | Tok/sec |
|---|---|---|---|---|
| RTX 3090 | 24 GB | Llama 3.1 8B | Q4_K_M | 90–112 |
| RTX 3090 | 24 GB | Llama 3.1 8B | Q8 | 40–50 |
| RTX 3090 | 24 GB | Llama 3.1 70B | Q4_K_M (offload) | 2–5 |
| RTX 3090 | 24 GB | Mistral 7B | Q4_K_M | 90–110 |
| RTX 4090 | 24 GB | Llama 3.1 8B | Q4_K_M | 95–126 |
| RTX 4090 | 24 GB | Llama 3.1 8B | Q8 | 80–87 |
| RTX 4090 | 24 GB | Gemma 3 27B | Q4 | 45–55 |
| A100 40GB | 40 GB | Llama 3.1 8B | FP16 | 55–80 |
| A100 40GB | 40 GB | Llama 3.1 70B | Q4_K_M | 20–25 |
Compare: CPU-only on i5-10500T → Llama 3.1 8B Q4 = 3–6 tok/sec. A single RTX 3090 is ~20–30× faster than one OptiPlex node for inference on the same model.
Recommended Phase 2 Plan¶
- Procure one used workstation with RTX 3090 (~$1,000–1,200)
- Join it to the existing Kubernetes cluster as a labeled inference node
- Install NVIDIA device plugin (30 min)
- Deploy ollama with GPU resource request
- Move LLM inference workloads from the 3 CPU inference nodes → GPU node
- Repurpose the 3 freed CPU nodes as general workers
Result: ~90–112 tok/sec inference on 8B Q4 (vs 9–18 tok/sec across 3 CPU nodes), 3 more general workers, same cluster management overhead.
See Compute Capacity for current CPU-only inference estimates.