Using GPU Acceleration

Set up GPU acceleration for faster inference

This tutorial will guide you through setting up GPU acceleration for LocalAI. GPU acceleration can significantly speed up model inference, especially for larger models.

Prerequisites

A compatible GPU (NVIDIA, AMD, Intel, or Apple Silicon)
LocalAI installed
Basic understanding of your system’s GPU setup

Check Your GPU

First, verify you have a compatible GPU:

NVIDIA

  nvidia-smi

You should see your GPU information. Ensure you have CUDA 11.7 or 12.0+ installed.

AMD

  rocminfo

Intel

  intel_gpu_top  # if available

Apple Silicon (macOS)

Apple Silicon (M1/M2/M3) GPUs are automatically detected. No additional setup needed!

Installation Methods

Method 1: Docker with GPU Support (Recommended)

NVIDIA CUDA

  # CUDA 12.0
docker run -p 8080:8080 --gpus all --name local-ai \
  -ti localai/localai:latest-gpu-nvidia-cuda-12

# CUDA 11.7
docker run -p 8080:8080 --gpus all --name local-ai \
  -ti localai/localai:latest-gpu-nvidia-cuda-11

Prerequisites: Install NVIDIA Container Toolkit

AMD ROCm

  docker run -p 8080:8080 \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --name local-ai \
  -ti localai/localai:latest-gpu-hipblas

Intel GPU

  docker run -p 8080:8080 --name local-ai \
  -ti localai/localai:latest-gpu-intel

Apple Silicon

GPU acceleration works automatically when running on macOS with Apple Silicon. Use the standard CPU image - Metal acceleration is built-in.

Method 2: AIO Images with GPU

AIO images are also available with GPU support:

  # NVIDIA CUDA 12
docker run -p 8080:8080 --gpus all --name local-ai \
  -ti localai/localai:latest-aio-gpu-nvidia-cuda-12

# AMD
docker run -p 8080:8080 \
  --device=/dev/kfd --device=/dev/dri --group-add=video \
  --name local-ai \
  -ti localai/localai:latest-aio-gpu-hipblas

Method 3: Build from Source

For building with GPU support from source, see the Build Guide.

Configuring Models for GPU

Automatic Detection

LocalAI automatically detects GPU capabilities and downloads the appropriate backend when you install models from the gallery.

Manual Configuration

In your model YAML configuration, specify GPU layers:

  name: my-model
parameters:
  model: model.gguf
backend: llama-cpp
# Offload layers to GPU (adjust based on your GPU memory)
f16: true
gpu_layers: 35  # Number of layers to offload to GPU

GPU Layers Guidelines:

Small GPU (4-6GB): 20-30 layers
Medium GPU (8-12GB): 30-40 layers
Large GPU (16GB+): 40+ layers or set to model’s total layer count

Finding the Right Number of Layers

Start with a conservative number (e.g., 20)
Monitor GPU memory usage: nvidia-smi (NVIDIA) or rocminfo (AMD)
Gradually increase until you reach GPU memory limits
For maximum performance, offload all layers if you have enough VRAM

Verifying GPU Usage

Check if GPU is Being Used

NVIDIA

  # Watch GPU usage in real-time
watch -n 1 nvidia-smi

You should see:

GPU utilization > 0%
Memory usage increasing
Processes running on GPU

AMD

  rocm-smi

Check Logs

Enable debug mode to see GPU information in logs:

  DEBUG=true local-ai

Look for messages indicating GPU initialization and layer offloading.

Performance Tips

1. Optimize GPU Layers

Offload as many layers as your GPU memory allows
Balance between GPU and CPU layers for best performance
Use f16: true for better GPU performance

2. Batch Processing

GPU excels at batch processing. Process multiple requests together when possible.

3. Model Quantization

Even with GPU, quantized models (Q4_K_M) often provide the best speed/quality balance.

4. Context Size

Larger context sizes use more GPU memory. Adjust based on your GPU:

  context_size: 4096  # Adjust based on GPU memory

Troubleshooting

GPU Not Detected

Check drivers: Ensure GPU drivers are installed

Check Docker: Verify Docker has GPU access

  docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Check logs: Enable debug mode and check for GPU-related errors

Out of GPU Memory

Reduce gpu_layers in model configuration
Use a smaller model or lower quantization
Reduce context_size
Close other GPU-using applications

Slow Performance

Ensure you’re using the correct GPU image
Check that layers are actually offloaded (check logs)
Verify GPU drivers are up to date
Consider using a more powerful GPU or reducing model size

CUDA Errors

Ensure CUDA version matches (11.7 vs 12.0)
Check CUDA compatibility with your GPU
Try rebuilding with REBUILD=true

Platform-Specific Notes

NVIDIA Jetson (L4T)

Use the L4T-specific images:

  docker run -p 8080:8080 --runtime nvidia --gpus all \
  --name local-ai \
  -ti localai/localai:latest-nvidia-l4t-arm64

Apple Silicon

Metal acceleration is automatic
No special Docker flags needed
Use standard CPU images - Metal is built-in
For best performance, build from source on macOS

What’s Next?

GPU Acceleration Documentation - Detailed GPU information
Performance Tuning - Optimize your setup
VRAM Management - Manage GPU memory efficiently

Star us on GitHub !

Using GPU Acceleration

Prerequisites

Check Your GPU

NVIDIA

AMD

Intel

Apple Silicon (macOS)

Installation Methods

Method 1: Docker with GPU Support (Recommended)

NVIDIA CUDA

AMD ROCm

Intel GPU

Apple Silicon

Method 2: AIO Images with GPU

Method 3: Build from Source

Configuring Models for GPU

Automatic Detection

Manual Configuration

Finding the Right Number of Layers

Verifying GPU Usage

Check if GPU is Being Used

NVIDIA

AMD

Check Logs

Performance Tips

1. Optimize GPU Layers

2. Batch Processing

3. Model Quantization

4. Context Size

Troubleshooting

GPU Not Detected

Out of GPU Memory

Slow Performance

CUDA Errors

Platform-Specific Notes

NVIDIA Jetson (L4T)

Apple Silicon

What’s Next?

See Also

Star us on GitHub !

Using GPU Acceleration

Prerequisites link

Check Your GPU link

NVIDIA link

AMD link

Intel link

Apple Silicon (macOS) link

Installation Methods link

Method 1: Docker with GPU Support (Recommended) link

NVIDIA CUDA link

AMD ROCm link

Intel GPU link

Apple Silicon link

Method 2: AIO Images with GPU link

Method 3: Build from Source link

Configuring Models for GPU link

Automatic Detection link

Manual Configuration link

Finding the Right Number of Layers link

Verifying GPU Usage link

Check if GPU is Being Used link

NVIDIA link

AMD link

Check Logs link

Performance Tips link

1. Optimize GPU Layers link

2. Batch Processing link

3. Model Quantization link

4. Context Size link

Troubleshooting link

GPU Not Detected link

Out of GPU Memory link

Slow Performance link

CUDA Errors link

Platform-Specific Notes link

NVIDIA Jetson (L4T) link

Apple Silicon link

What’s Next? link

See Also link

Prerequisites

Check Your GPU

NVIDIA

AMD

Intel

Apple Silicon (macOS)

Installation Methods

Method 1: Docker with GPU Support (Recommended)

NVIDIA CUDA

AMD ROCm

Intel GPU

Apple Silicon

Method 2: AIO Images with GPU

Method 3: Build from Source

Configuring Models for GPU

Automatic Detection

Manual Configuration

Finding the Right Number of Layers

Verifying GPU Usage

Check if GPU is Being Used

NVIDIA

AMD

Check Logs

Performance Tips

1. Optimize GPU Layers

2. Batch Processing

3. Model Quantization

4. Context Size

Troubleshooting

GPU Not Detected

Out of GPU Memory

Slow Performance

CUDA Errors

Platform-Specific Notes

NVIDIA Jetson (L4T)

Apple Silicon

What’s Next?

See Also