Skip to content

Hardware and GPU Configuration

OptimaNode can run AI models using a GPU for accelerated inference, or on CPU only if no GPU is available. GPU inference is significantly faster and is recommended for any production deployment.

GPU layers (offloading)

When a GPU is available, OptimaNode offloads model layers to it for accelerated computation. By default, all layers are offloaded to the GPU. This gives the best performance but requires enough VRAM to hold the entire model.

If the model is larger than the available VRAM, the Node will attempt to load it anyway — layers that do not fit in VRAM fall back to system RAM and are computed on the CPU. This partial offload is slower than full GPU inference but faster than CPU-only.

If the machine has no GPU at all, inference runs entirely on the CPU using system RAM.

How much VRAM do I need?

VRAM requirements depend on the model's size and quantisation. The table below gives approximate figures for common configurations:

Model size Q4_K_M quantisation Q8_0 quantisation
3B parameters ~2 GB ~3 GB
7B parameters ~4 GB ~8 GB
13B parameters ~8 GB ~14 GB
32B parameters ~20 GB ~34 GB
70B parameters ~40 GB ~70 GB

These are approximate figures. Actual usage also depends on the context window size — a larger context window uses more VRAM.

OptimaNode sets the context window automatically based on available hardware at startup. You can adjust it in the executor parameter editor, but increasing it beyond what the hardware can hold will cause the model to fail to start or degrade performance significantly.

GPU backends

On Windows and Linux (x64), two GPU compute backends are available. The backend is selected per-executor in the parameter editor:

Backend Description
Vulkan (default) Works across NVIDIA, AMD, and Intel GPUs. No additional drivers or runtimes required beyond standard GPU drivers.
CUDA 13 NVIDIA GPUs only. Requires the CUDA 13 runtime to be installed. Generally faster than Vulkan on supported hardware.

On macOS (Apple Silicon), inference uses Apple's Metal framework automatically. No backend selection is needed.

On ARM64 (Windows or Linux), Vulkan is used.

When to use CUDA 13

Switch to the CUDA 13 backend if: - You have an NVIDIA GPU - You have installed the CUDA 13 runtime - You want maximum throughput on that hardware

The Vulkan backend is a safe default and will work on all supported GPU hardware without any additional setup.

Multi-GPU tensor split

If the Node machine has more than one GPU, you can distribute the model's layers across them using the Tensor Split parameter. This allows you to run models that are too large for a single GPU's VRAM.

The tensor split is expressed as a comma-separated list of proportional weights. For example:

Value Effect
3,2 GPU 0 gets 60% of layers, GPU 1 gets 40%
1,1 Layers split evenly between two GPUs
2,1,1 GPU 0 gets 50%, GPU 1 and GPU 2 get 25% each

Set Tensor Split in the executor parameter editor, then restart the executor to apply.

CPU-only inference

If no GPU is present, or if you explicitly want CPU inference, OptimaNode falls back to using system RAM. CPU inference is much slower than GPU inference — expect responses to take several seconds per token rather than milliseconds.

For CPU inference, ensure the machine has enough system RAM to hold the model entirely in memory. The approximate figures in the VRAM table above apply equally to RAM for CPU-only configurations.

Monitoring hardware usage

Live hardware metrics — CPU usage, RAM, GPU utilisation, and VRAM consumption — are displayed in the Monitoring panel within each node's detail view in the Gateway.

Node monitoring panel showing live CPU, RAM, GPU, and VRAM usage graphs