Skip to main content

Getting Started

Get Semcache up and running as an HTTP proxy in a few minutes.

Quick Start

Pull and run the Semcache Docker image:

docker run -p 8080:8080 semcache/semcache:latest

Semcache will start on http://localhost:8080 and is ready to proxy LLM requests.

Setting up proxy client

Semcache acts as a drop-in replacement for LLM APIs. Point your existing SDK to Semcache instead of the provider's endpoint:

from openai import OpenAI
import os

# Point to Semcache instead of OpenAI directly
client = OpenAI(
base_url="http://localhost:8080", # Semcache endpoint
api_key=os.getenv("OPENAI_API_KEY") # Your OpenAI API key
)

# First request - cache miss, forwards to OpenAI
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(f"Response: {response.choices[0].message.content}")

This request will:

  1. Go to Semcache first
  2. Since it's not cached, Semcache forwards it to the upstream provider
  3. The provider responds with the answer
  4. Semcache caches the response and returns it to you

Testing Semantic Similarity

Now try a semantically similar but differently worded question:

# Second request - semantically similar, should be a cache hit
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Tell me France's capital city"}]
)
print(f"Response: {response.choices[0].message.content}")

Even though the wording is different, Semcache recognizes the semantic similarity and returns the cached response instantly - no API call to the upstream provider!

Checking Cache Status

You can verify cache hits by checking the response headers. If there is a cache hit the X-Cache-Status header will be set to hit:

# Use with_raw_response to access headers
response = client.chat.completions.with_raw_response.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the capital of France?"}]
)

# Check if it was a cache hit
cache_status = response.headers.get("X-Cache-Status")
print(f"Cache status: {cache_status}") # Should show "hit"

# Access the actual response content
completion = response.parse()
print(f"Response: {completion.choices[0].message.content}")

Setting up cache aside instance

Install with

pip install semcache
from semcache import Semcache

# Initialize the client
client = Semcache(base_url="http://localhost:8080")

# Store a key-data pair
client.put("What is the capital of France?", "Paris")

# Retrieve data by semantic similarity
response = client.get("Tell me France's capital city.")
print(response) # "Paris"

Monitor Your Cache

Visit the built-in admin dashboard at http://localhost:8080/admin to monitor:

  • Cache hit rates - See how effectively your cache is working
  • Memory usage - Track resource consumption
  • Number of entries - Monitor cache size and eviction

The process is identical across all providers - Semcache automatically detects the provider based on the endpoint path and forwards requests appropriately.

Next Steps

  • LLM Providers & Tools - Configure additional providers like DeepSeek, Mistral, and custom LLMs
  • Configuration - Adjust similarity thresholds and cache behavior
  • Monitoring - Set up production monitoring with Prometheus and Grafana