Skip to main content

Embedding Model Configuration

Understanding and configuring the embedding model that powers Semcache's semantic similarity matching.

Current Embedding Model

Semcache uses FastEmbed with the AllMiniLML6V2 model for generating embeddings.

Model Specifications

Model: sentence-transformers/all-MiniLM-L6-v2
- Dimensions: 384
- Max Sequence Length: 256 tokens
- Model Size: ~23MB
- Language: Primarily English
- Performance: ~50ms per embedding on CPU

Why This Model?

Advantages:

  • Fast inference: Optimized for speed
  • Good quality: Excellent semantic understanding
  • Lightweight: Small memory footprint
  • Proven: Widely used in production systems

Trade-offs:

  • English-focused: Limited multilingual support
  • Shorter context: 256 token limit
  • CPU-only: No GPU acceleration currently

Embedding Process

1. Text Extraction

// Extract user message content
let user_message = request.messages
.iter()
.find(|m| m.role == "user")
.map(|m| &m.content)
.unwrap_or_default();

2. Embedding Generation

// Generate 384-dimensional vector
let embedding: Vec<f32> = embedding_service
.generate_embedding(user_message)
.await?;

3. Vector Normalization

// Normalize for cosine similarity
let normalized = normalize_vector(&embedding);
// Search cache with FAISS
let similar_entries = faiss_index
.search(&normalized, threshold: 0.9)
.await?;

Performance Characteristics

Embedding Generation Time

Input LengthGeneration TimeUse Case
10 words~20msShort queries
50 words~35msTypical prompts
100 words~50msLong prompts
200+ words~70msVery long prompts

Memory Usage

Per embedding: 384 floats × 4 bytes = 1.5KB
Per cached entry: ~1.5KB + response size

Example:
- 1,000 cached entries: ~1.5MB for embeddings
- Plus response storage: variable (typically 10-100KB each)

Model Comparison

Available Models (Future Support)

ModelDimensionsSpeedQualityLanguagesSize
all-MiniLM-L6-v2384FastGoodEN23MB
all-mpnet-base-v2768MediumBetterEN120MB
multilingual-e5-small384FastGoodMulti118MB
text-embedding-ada-0021536APIBestMultiAPI

Choosing the Right Model

For speed (current):

  • all-MiniLM-L6-v2 ✅
  • Best for high-throughput applications

For quality (future):

  • all-mpnet-base-v2
  • Better semantic understanding, 2x slower

For multilingual (future):

  • multilingual-e5-small
  • Support for 100+ languages

For maximum quality (future):

  • OpenAI API (text-embedding-ada-002)
  • Requires API calls for embeddings

Similarity Calculation

Cosine Similarity

similarity = (a · b) / (||a|| × ||b||)

Where:
- a, b are normalized embedding vectors
- · is dot product
- ||·|| is vector magnitude (1.0 for normalized vectors)

Text Preprocessing

Current Processing

  • No preprocessing: Raw user message used directly
  • Token limits: Truncated at model's max length (256 tokens)
  • Encoding: UTF-8 text passed to FastEmbed

Future Preprocessing Options

# Planned preprocessing configuration
embedding:
preprocessing:
lowercase: true
remove_punctuation: false
remove_stopwords: false
max_length: 256
truncation: "end" # start, end, middle

Language Support

Current Support

  • Primary: English
  • Limited: Other Latin-script languages
  • Poor: Non-Latin scripts (Arabic, Chinese, etc.)

Example Similarity Scores

"What is AI?" vs "What is artificial intelligence?"
→ Similarity: ~0.92 ✅

"Qu'est-ce que l'IA?" vs "What is AI?"
→ Similarity: ~0.75 (French, may miss cache)

"什么是人工智能?" vs "What is AI?"
→ Similarity: ~0.45 (Chinese, will miss cache)

Future Multilingual Support

embedding:
model: "multilingual-e5-small"
languages: ["en", "es", "fr", "de", "zh"]

Custom Embedding Models

Future Support for Custom Models

Local models:

embedding:
type: "local"
model_path: "/path/to/custom/model"
device: "cpu" # or "cuda"

API-based models:

embedding:
type: "api"
provider: "openai"
model: "text-embedding-ada-002"
api_key: "${OPENAI_API_KEY}"

Hugging Face models:

embedding:
type: "huggingface"
model: "sentence-transformers/paraphrase-MiniLM-L6-v2"
revision: "main"

Troubleshooting

Common Issues

Slow embedding generation:

  • Monitor CPU usage during embedding generation
  • Consider model with fewer parameters
  • Check for memory swapping

Poor similarity matching:

  • Verify language compatibility
  • Check similarity threshold setting
  • Test with known similar prompts

Memory issues:

  • Each embedding uses 1.5KB
  • Large caches require significant RAM
  • Monitor total memory usage

Debugging Tools

Admin dashboard:

http://localhost:8080/admin

Manual similarity testing:

# Test similarity between prompts (future CLI tool)
semcache similarity "What is AI?" "What is artificial intelligence?"
# Output: 0.92

Performance Optimization

CPU Optimization

  • FastEmbed uses optimized ONNX runtime
  • Benefits from modern CPU features (AVX, etc.)
  • Multi-core systems can handle concurrent requests

Memory Optimization

  • Embeddings cached in memory
  • Model loaded once at startup
  • Automatic cleanup of old embeddings

Future GPU Support

embedding:
device: "cuda" # Use GPU acceleration
batch_size: 32 # Process multiple embeddings together

Model Updates

Current Limitations

  • Model fixed at compile time
  • No runtime model switching
  • Updates require application restart

Future Flexibility

  • Hot-swappable models
  • A/B testing different models
  • Gradual migration between models