Deploying to Production

Best practices for running LocalAI in production environments

This tutorial covers best practices for deploying LocalAI in production environments, including security, performance, monitoring, and reliability considerations.

Prerequisites

LocalAI installed and tested
Understanding of your deployment environment
Basic knowledge of Docker, Kubernetes, or your chosen deployment method

Security Considerations

1. API Key Protection

Always use API keys in production:

  # Set API key
API_KEY=your-secure-random-key local-ai

# Or multiple keys
API_KEY=key1,key2,key3 local-ai

Best Practices:

Use strong, randomly generated keys
Store keys securely (environment variables, secrets management)
Rotate keys regularly
Use different keys for different services/clients

2. Network Security

Never expose LocalAI directly to the internet without protection:

Use a reverse proxy (nginx, Traefik, Caddy)
Enable HTTPS/TLS
Use firewall rules to restrict access
Consider VPN or private network access only

Example nginx configuration:

  server {
    listen 443 ssl;
    server_name localai.example.com;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

3. Resource Limits

Set appropriate resource limits to prevent resource exhaustion:

  # Docker Compose example
services:
  localai:
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 16G
        reservations:
          cpus: '2'
          memory: 8G

Deployment Methods

Docker Compose (Recommended for Small-Medium Deployments)

  version: '3.8'

services:
  localai:
    image: localai/localai:latest
    ports:
      - "8080:8080"
    environment:
      - API_KEY=${API_KEY}
      - DEBUG=false
      - MODELS_PATH=/models
    volumes:
      - ./models:/models
      - ./config:/config
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 16G

Kubernetes

See the Kubernetes Deployment Guide for detailed instructions.

Key considerations:

Use ConfigMaps for configuration
Use Secrets for API keys
Set resource requests and limits
Configure health checks and liveness probes
Use PersistentVolumes for model storage

Systemd Service (Linux)

Create a systemd service file:

  [Unit]
Description=LocalAI Service
After=network.target

[Service]
Type=simple
User=localai
Environment="API_KEY=your-key"
Environment="MODELS_PATH=/var/lib/localai/models"
ExecStart=/usr/local/bin/local-ai
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Performance Optimization

1. Model Selection

Use quantized models (Q4_K_M) for production
Choose models appropriate for your hardware
Consider model size vs. quality trade-offs

2. Resource Allocation

  # Model configuration
name: production-model
parameters:
  model: model.gguf
context_size: 2048  # Adjust based on needs
threads: 4  # Match CPU cores
gpu_layers: 35  # If using GPU

3. Caching

Enable prompt caching for repeated queries:

  prompt_cache_path: "cache"
prompt_cache_all: true

4. Connection Pooling

If using a reverse proxy, configure connection pooling:

  upstream localai {
    least_conn;
    server localhost:8080 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

Monitoring and Logging

1. Health Checks

LocalAI provides health check endpoints:

  # Readiness check
curl http://localhost:8080/readyz

# Health check
curl http://localhost:8080/healthz

2. Logging

Configure appropriate log levels:

  # Production: minimal logging
DEBUG=false local-ai

# Development: detailed logging
DEBUG=true local-ai

3. Metrics

Monitor key metrics:

Request rate
Response times
Error rates
Resource usage (CPU, memory, GPU)
Model loading times

4. Alerting

Set up alerts for:

Service downtime
High error rates
Resource exhaustion
Slow response times

High Availability

1. Multiple Instances

Run multiple LocalAI instances behind a load balancer:

  # Docker Compose with multiple instances
services:
  localai1:
    image: localai/localai:latest
    # ... configuration
  
  localai2:
    image: localai/localai:latest
    # ... configuration
  
  nginx:
    image: nginx:alpine
    # Load balance between localai1 and localai2

2. Model Replication

Ensure models are available on all instances:

Shared storage (NFS, S3, etc.)
Model synchronization
Consistent model versions

3. Graceful Shutdown

LocalAI supports graceful shutdown. Ensure your deployment method handles SIGTERM properly.

Backup and Recovery

1. Model Backups

Regularly backup your models and configurations:

  # Backup models
tar -czf models-backup-$(date +%Y%m%d).tar.gz models/

# Backup configurations
tar -czf config-backup-$(date +%Y%m%d).tar.gz config/

2. Configuration Management

Version control your configurations:

Use Git for YAML configurations
Document model versions
Track configuration changes

3. Disaster Recovery

Plan for:

Model storage recovery
Configuration restoration
Service restoration procedures

Scaling Considerations

Horizontal Scaling

Run multiple instances
Use load balancing
Consider stateless design (shared model storage)

Vertical Scaling

Increase resources (CPU, RAM, GPU)
Use more powerful hardware
Optimize model configurations

Maintenance

1. Updates

Test updates in staging first
Plan maintenance windows
Have rollback procedures ready

2. Model Updates

Test new models before production
Keep model versions documented
Have rollback capability

3. Monitoring

Regularly review:

Performance metrics
Error logs
Resource usage trends
User feedback

Production Checklist

Before going live, ensure:

API keys configured and secured
HTTPS/TLS enabled
Firewall rules configured
Resource limits set
Health checks configured
Monitoring in place
Logging configured
Backups scheduled
Documentation updated
Team trained on operations
Incident response plan ready

What’s Next?

Kubernetes Deployment - Deploy on Kubernetes
Performance Tuning - Optimize performance
Security Best Practices - Security guidelines
Troubleshooting Guide - Production issues

Star us on GitHub !

Deploying to Production

Prerequisites

Security Considerations

1. API Key Protection

2. Network Security

3. Resource Limits

Deployment Methods

Docker Compose (Recommended for Small-Medium Deployments)

Kubernetes

Systemd Service (Linux)

Performance Optimization

1. Model Selection

2. Resource Allocation

3. Caching

4. Connection Pooling

Monitoring and Logging

1. Health Checks

2. Logging

3. Metrics

4. Alerting

High Availability

1. Multiple Instances

2. Model Replication

3. Graceful Shutdown

Backup and Recovery

1. Model Backups

2. Configuration Management

3. Disaster Recovery

Scaling Considerations

Horizontal Scaling

Vertical Scaling

Maintenance

1. Updates

2. Model Updates

3. Monitoring

Production Checklist

What’s Next?

See Also

Star us on GitHub !

Deploying to Production

Prerequisites link

Security Considerations link

1. API Key Protection link

2. Network Security link

3. Resource Limits link

Deployment Methods link

Docker Compose (Recommended for Small-Medium Deployments) link

Kubernetes link

Systemd Service (Linux) link

Performance Optimization link

1. Model Selection link

2. Resource Allocation link

3. Caching link

4. Connection Pooling link

Monitoring and Logging link

1. Health Checks link

2. Logging link

3. Metrics link

4. Alerting link

High Availability link

1. Multiple Instances link

2. Model Replication link

3. Graceful Shutdown link

Backup and Recovery link

1. Model Backups link

2. Configuration Management link

3. Disaster Recovery link

Scaling Considerations link

Horizontal Scaling link

Vertical Scaling link

Maintenance link

1. Updates link

2. Model Updates link

3. Monitoring link

Production Checklist link

What’s Next? link

See Also link

Prerequisites

Security Considerations

1. API Key Protection

2. Network Security

3. Resource Limits

Deployment Methods

Docker Compose (Recommended for Small-Medium Deployments)

Kubernetes

Systemd Service (Linux)

Performance Optimization

1. Model Selection

2. Resource Allocation

3. Caching

4. Connection Pooling

Monitoring and Logging

1. Health Checks

2. Logging

3. Metrics

4. Alerting

High Availability

1. Multiple Instances

2. Model Replication

3. Graceful Shutdown

Backup and Recovery

1. Model Backups

2. Configuration Management

3. Disaster Recovery

Scaling Considerations

Horizontal Scaling

Vertical Scaling

Maintenance

1. Updates

2. Model Updates

3. Monitoring

Production Checklist

What’s Next?

See Also