Skip to main content

Caching

Cache LLM Responses

note

For OpenAI/Anthropic Prompt Caching, go here

LiteLLM supports:

  • In Memory Cache
  • Redis Cache
  • Qdrant Semantic Cache
  • Redis Semantic Cache
  • s3 Bucket Cache

Quick Start - Redis, s3 Cache, Semantic Cache

Caching can be enabled by adding the cache key in the config.yaml

Step 1: Add cache to the config.yaml

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
- model_name: text-embedding-ada-002
litellm_params:
model: text-embedding-ada-002

litellm_settings:
set_verbose: True
cache: True # set cache responses to True, litellm defaults to using a redis cache

[OPTIONAL] Step 1.5: Add redis namespaces, default ttl

Namespace

If you want to create some folder for your keys, you can set a namespace, like this:

litellm_settings:
cache: true
cache_params: # set cache params for redis
type: redis
namespace: "litellm.caching.caching"

and keys will be stored like:

litellm.caching.caching:<hash>

Redis Cluster

model_list:
- model_name: "*"
litellm_params:
model: "*"


litellm_settings:
cache: True
cache_params:
type: redis
redis_startup_nodes: [{"host": "127.0.0.1", "port": "7001"}]

Redis Sentinel

model_list:
- model_name: "*"
litellm_params:
model: "*"


litellm_settings:
cache: true
cache_params:
type: "redis"
service_name: "mymaster"
sentinel_nodes: [["localhost", 26379]]

TTL

litellm_settings:
cache: true
cache_params: # set cache params for redis
type: redis
ttl: 600 # will be cached on redis for 600s
# default_in_memory_ttl: Optional[float], default is None. time in seconds.
# default_in_redis_ttl: Optional[float], default is None. time in seconds.

SSL

just set REDIS_SSL="True" in your .env, and LiteLLM will pick this up.

REDIS_SSL="True"

For quick testing, you can also use REDIS_URL, eg.:

REDIS_URL="rediss://.."

but we don't recommend using REDIS_URL in prod. We've noticed a performance difference between using it vs. redis_host, port, etc.

Step 2: Add Redis Credentials to .env

Set either REDIS_URL or the REDIS_HOST in your os environment, to enable caching.

REDIS_URL = ""        # REDIS_URL='redis://username:password@hostname:port/database'
## OR ##
REDIS_HOST = "" # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'

Additional kwargs
You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:

REDIS_<redis-kwarg-name> = ""

See how it's read from the environment

Step 3: Run proxy with config

$ litellm --config /path/to/config.yaml

Using Caching - /chat/completions

Send the same request twice:

curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

Set cache for proxy, but not on the actual llm api call

Use this if you just want to enable features like rate limiting, and loadbalancing across multiple instances.

Set supported_call_types: [] to disable caching on the actual api call.

litellm_settings:
cache: True
cache_params:
type: redis
supported_call_types: []

Debugging Caching - /cache/ping

LiteLLM Proxy exposes a /cache/ping endpoint to test if the cache is working as expected

Usage

curl --location 'http://0.0.0.0:4000/cache/ping'  -H "Authorization: Bearer sk-1234"

Expected Response - when cache healthy

{
"status": "healthy",
"cache_type": "redis",
"ping_response": true,
"set_cache_response": "success",
"litellm_cache_params": {
"supported_call_types": "['completion', 'acompletion', 'embedding', 'aembedding', 'atranscription', 'transcription']",
"type": "redis",
"namespace": "None"
},
"redis_cache_params": {
"redis_client": "Redis<ConnectionPool<Connection<host=redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com,port=16337,db=0>>>",
"redis_kwargs": "{'url': 'redis://:******@redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com:16337'}",
"async_redis_conn_pool": "BlockingConnectionPool<Connection<host=redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com,port=16337,db=0>>",
"redis_version": "7.2.0"
}
}

Advanced

Control Call Types Caching is on for - (/chat/completion, /embeddings, etc.)

By default, caching is on for all call types. You can control which call types caching is on for by setting supported_call_types in cache_params

Cache will only be on for the call types specified in supported_call_types

litellm_settings:
cache: True
cache_params:
type: redis
supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
# /chat/completions, /completions, /embeddings, /audio/transcriptions

Set Cache Params on config.yaml

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
- model_name: text-embedding-ada-002
litellm_params:
model: text-embedding-ada-002

litellm_settings:
set_verbose: True
cache: True # set cache responses to True, litellm defaults to using a redis cache
cache_params: # cache_params are optional
type: "redis" # The type of cache to initialize. Can be "local" or "redis". Defaults to "local".
host: "localhost" # The host address for the Redis cache. Required if type is "redis".
port: 6379 # The port number for the Redis cache. Required if type is "redis".
password: "your_password" # The password for the Redis cache. Required if type is "redis".

# Optional configurations
supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
# /chat/completions, /completions, /embeddings, /audio/transcriptions

Turn on / off caching per request.

The proxy support 4 cache-controls:

  • ttl: Optional(int) - Will cache the response for the user-defined amount of time (in seconds).
  • s-maxage: Optional(int) Will only accept cached responses that are within user-defined range (in seconds).
  • no-cache: Optional(bool) Will not return a cached response, but instead call the actual endpoint.
  • no-store: Optional(bool) Will not cache the response.

Let us know if you need more

Turn off caching

Set no-cache=True, this will not return a cached response

import os
from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"no-cache": True # will not return a cached response
}
}
)

Turn on caching

By default cache is always on

import os
from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo"
)

Set ttl

Set ttl=600, this will caches response for 10 minutes (600 seconds)

import os
from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"ttl": 600 # caches response for 10 minutes
}
}
)

Set s-maxage

Set s-maxage, this will only get responses cached within last 10 minutes

import os
from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"s-maxage": 600 # only get responses cached within last 10 minutes
}
}
)

Turn on / off caching per Key.

  1. Add cache params when creating a key full list
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"user_id": "222",
"metadata": {
"cache": {
"no-cache": true
}
}
}'
  1. Test it!
curl -X POST 'http://localhost:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_NEW_KEY>' \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "bom dia"}]}'

Deleting Cache Keys - /cache/delete

In order to delete a cache key, send a request to /cache/delete with the keys you want to delete

Example

curl -X POST "http://0.0.0.0:4000/cache/delete" \
-H "Authorization: Bearer sk-1234" \
-d '{"keys": ["586bf3f3c1bf5aecb55bd9996494d3bbc69eb58397163add6d49537762a7548d", "key2"]}'
# {"status":"success"}

Viewing Cache Keys from responses

You can view the cache_key in the response headers, on cache hits the cache key is sent as the x-litellm-cache-key response headers

curl -i --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "gpt-3.5-turbo",
"user": "ishan",
"messages": [
{
"role": "user",
"content": "what is litellm"
}
],
}'

Response from litellm proxy

date: Thu, 04 Apr 2024 17:37:21 GMT
content-type: application/json
x-litellm-cache-key: 586bf3f3c1bf5aecb55bd9996494d3bbc69eb58397163add6d49537762a7548d

{
"id": "chatcmpl-9ALJTzsBlXR9zTxPvzfFFtFbFtG6T",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "I'm sorr.."
"role": "assistant"
}
}
],
"created": 1712252235,
}

Set Caching Default Off - Opt in only

  1. Set mode: default_off for caching
model_list:
- model_name: fake-openai-endpoint
litellm_params:
model: openai/fake
api_key: fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/

# default off mode
litellm_settings:
set_verbose: True
cache: True
cache_params:
mode: default_off # 👈 Key change cache is default_off
  1. Opting in to cache when cache is default off
import os
from openai import OpenAI

client = OpenAI(api_key=<litellm-api-key>, base_url="http://0.0.0.0:4000")

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
"cache": {"use-cache": True}
}
)

Turn on batch_redis_requests

What it does? When a request is made:

  • Check if a key starting with litellm:<hashed_api_key>:<call_type>: exists in-memory, if no - get the last 100 cached requests for this key and store it

  • New requests are stored with this litellm:.. as the namespace

Why? Reduce number of redis GET requests. This improved latency by 46% in prod load tests.

Usage

litellm_settings:
cache: true
cache_params:
type: redis
... # remaining redis args (host, port, etc.)
callbacks: ["batch_redis_requests"] # 👈 KEY CHANGE!

SEE CODE

Supported cache_params on proxy config.yaml

cache_params:
# ttl
ttl: Optional[float]
default_in_memory_ttl: Optional[float]
default_in_redis_ttl: Optional[float]

# Type of cache (options: "local", "redis", "s3")
type: s3

# List of litellm call types to cache for
# Options: "completion", "acompletion", "embedding", "aembedding"
supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
# /chat/completions, /completions, /embeddings, /audio/transcriptions

# Redis cache parameters
host: localhost # Redis server hostname or IP address
port: "6379" # Redis server port (as a string)
password: secret_password # Redis server password
namespace: Optional[str] = None,


# S3 cache parameters
s3_bucket_name: your_s3_bucket_name # Name of the S3 bucket
s3_region_name: us-west-2 # AWS region of the S3 bucket
s3_api_version: 2006-03-01 # AWS S3 API version
s3_use_ssl: true # Use SSL for S3 connections (options: true, false)
s3_verify: true # SSL certificate verification for S3 connections (options: true, false)
s3_endpoint_url: https://s3.amazonaws.com # S3 endpoint URL
s3_aws_access_key_id: your_access_key # AWS Access Key ID for S3
s3_aws_secret_access_key: your_secret_key # AWS Secret Access Key for S3
s3_aws_session_token: your_session_token # AWS Session Token for temporary credentials

Advanced - user api key cache ttl

Configure how long the in-memory cache stores the key object (prevents db requests)

general_settings:
user_api_key_cache_ttl: <your-number> #time in seconds

By default this value is set to 60s.