LocalAI provides a REST API that is compatible with OpenAI’s API specification. This document provides a complete reference for all available endpoints.

Base URL

All API requests should be made to:

  http://localhost:8080/v1
  

For production deployments, replace localhost:8080 with your server’s address.

Authentication

If API keys are configured (via API_KEY environment variable), include the key in the Authorization header:

  Authorization: Bearer your-api-key
  

Endpoints

Chat Completions

Create a model response for the given chat conversation.

Endpoint: POST /v1/chat/completions

Request Body:

  {
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "top_p": 1.0,
  "top_k": 40,
  "stream": false
}
  

Parameters:

ParameterTypeDescriptionDefault
modelstringThe model to useRequired
messagesarrayArray of message objectsRequired
temperaturenumberSampling temperature (0-2)0.7
max_tokensintegerMaximum tokens to generateModel default
top_pnumberNucleus sampling parameter1.0
top_kintegerTop-k sampling parameter40
streambooleanStream responsesfalse
toolsarrayAvailable tools/functions-
tool_choicestringTool selection mode“auto”

Response:

  {
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}
  

Example:

  curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
  

Completions

Create a completion for the provided prompt.

Endpoint: POST /v1/completions

Request Body:

  {
  "model": "gpt-4",
  "prompt": "The capital of France is",
  "temperature": 0.7,
  "max_tokens": 10
}
  

Parameters:

ParameterTypeDescription
modelstringThe model to use
promptstringThe prompt to complete
temperaturenumberSampling temperature
max_tokensintegerMaximum tokens to generate
top_pnumberNucleus sampling
top_kintegerTop-k sampling
streambooleanStream responses

Example:

  curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "prompt": "The capital of France is",
    "max_tokens": 10
  }'
  

Edits

Create an edited version of the input.

Endpoint: POST /v1/edits

Request Body:

  {
  "model": "gpt-4",
  "instruction": "Make it more formal",
  "input": "Hey, how are you?",
  "temperature": 0.7
}
  

Example:

  curl http://localhost:8080/v1/edits \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "instruction": "Make it more formal",
    "input": "Hey, how are you?"
  }'
  

Embeddings

Get a vector representation of input text.

Endpoint: POST /v1/embeddings

Request Body:

  {
  "model": "text-embedding-ada-002",
  "input": "The food was delicious"
}
  

Response:

  {
  "object": "list",
  "data": [{
    "object": "embedding",
    "embedding": [0.1, 0.2, 0.3, ...],
    "index": 0
  }],
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 4
  }
}
  

Example:

  curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The food was delicious"
  }'
  

Audio Transcription

Transcribe audio into the input language.

Endpoint: POST /v1/audio/transcriptions

Request: multipart/form-data

Form Fields:

FieldTypeDescription
filefileAudio file to transcribe
modelstringModel to use (e.g., “whisper-1”)
languagestringLanguage code (optional)
promptstringOptional text prompt
response_formatstringResponse format (json, text, etc.)

Example:

  curl http://localhost:8080/v1/audio/transcriptions \
  -H "Authorization: Bearer not-needed" \
  -F file="@audio.mp3" \
  -F model="whisper-1"
  

Audio Speech (Text-to-Speech)

Generate audio from text.

Endpoint: POST /v1/audio/speech

Request Body:

  {
  "model": "tts-1",
  "input": "Hello, this is a test",
  "voice": "alloy",
  "response_format": "mp3"
}
  

Parameters:

ParameterTypeDescription
modelstringTTS model to use
inputstringText to convert to speech
voicestringVoice to use (alloy, echo, fable, etc.)
response_formatstringAudio format (mp3, opus, etc.)

Example:

  curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test",
    "voice": "alloy"
  }' \
  --output speech.mp3
  

Image Generation

Generate images from text prompts.

Endpoint: POST /v1/images/generations

Request Body:

  {
  "prompt": "A cute baby sea otter",
  "n": 1,
  "size": "256x256",
  "response_format": "url"
}
  

Parameters:

ParameterTypeDescription
promptstringText description of the image
nintegerNumber of images to generate
sizestringImage size (256x256, 512x512, etc.)
response_formatstringResponse format (url, b64_json)

Example:

  curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A cute baby sea otter",
    "size": "256x256"
  }'
  

List Models

List all available models.

Endpoint: GET /v1/models

Query Parameters:

ParameterTypeDescription
filterstringFilter models by name
excludeConfiguredbooleanExclude configured models

Response:

  {
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model"
    },
    {
      "id": "gpt-4-vision-preview",
      "object": "model"
    }
  ]
}
  

Example:

  curl http://localhost:8080/v1/models
  

Streaming Responses

Many endpoints support streaming. Set "stream": true in the request:

  curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
  

Stream responses are sent as Server-Sent Events (SSE):

  data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}

data: [DONE]
  

Error Handling

Error Response Format

  {
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "code": 400
  }
}
  

Common Error Codes

CodeDescription
400Bad Request - Invalid parameters
401Unauthorized - Missing or invalid API key
404Not Found - Model or endpoint not found
429Too Many Requests - Rate limit exceeded
500Internal Server Error - Server error
503Service Unavailable - Model not loaded

Example Error Handling

  import requests

try:
    response = requests.post(
        "http://localhost:8080/v1/chat/completions",
        json={"model": "gpt-4", "messages": [...]},
        timeout=30
    )
    response.raise_for_status()
    data = response.json()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 404:
        print("Model not found")
    elif e.response.status_code == 503:
        print("Model not loaded")
    else:
        print(f"Error: {e}")
  

Rate Limiting

LocalAI doesn’t enforce rate limiting by default. For production deployments, implement rate limiting at the reverse proxy or application level.

Best Practices

  1. Use appropriate timeouts: Set reasonable timeouts for requests
  2. Handle errors gracefully: Implement retry logic with exponential backoff
  3. Monitor token usage: Track usage fields in responses
  4. Use streaming for long responses: Enable streaming for better user experience
  5. Cache embeddings: Cache embedding results when possible
  6. Batch requests: Process multiple items together when possible

See Also

Last updated 17 Nov 2025, 19:34 +0100 . history