Semantic Caching Explained: 60% API Call Reduction

Quick Answer

Semantic caching uses vector embeddings to match similar queries and return cached responses instead of calling the API. Achieved 60% API call reduction in production. Key components: vector DB, embedding model, similarity threshold (0.95+).

The Problem: Repeated API Calls

In production AI systems, 30-50% of queries are semantically identical:

“Reset my password” = “I forgot my password” = “Can’t access account”
“What’s the price?” = “How much does it cost?” = “Price check”

These all get the same response but hit the API every time.

How Semantic Caching Works

User Query -> Embed Query -> Search Vector DB -> Similar Found?
                                                    |
                                    Yes ------------+------------ No
                                    |                              |
                              Return Cached                   Call AI API
                              Response                        Store Response

Step 1: Embed the Query

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_query(text: str) -> list[float]:
    return model.encode(text).tolist()

Step 2: Store Responses

async def store_response(query: str, response: str, cache):
    embedding = embed_query(query)
    await cache.insert(
        id=hash_query(query),
        vector=embedding,
        response=response,
        metadata={"created": now()}
    )

Step 3: Check Cache First

async def get_cached_or_compute(query: str, compute_fn, cache, threshold=0.95):
    embedding = embed_query(query)

    results = await cache.search(
        vector=embedding,
        top_k=1,
        threshold=threshold
    )

    if results and results[0]['score'] >= threshold:
        return results[0]['response'], True  # Cache hit

    # Miss - compute and store
    response = await compute_fn(query)
    await store_response(query, response, cache)
    return response, False

Production Results

After implementing semantic caching:

Month	Hit Rate	API Calls Saved	Monthly Savings
1	38%	114,000	$1,520
2	45%	135,000	$1,800
3	52%	156,000	$2,080
6	58%	174,000	$2,320

Total: 58% cache hit rate = $2,320/month savings

Choosing Similarity Threshold

Threshold	Hit Rate	Quality Risk
0.99	Very low	None
0.95	Medium	Low
0.90	High	Medium
0.85	Very high	High

Recommendation: Start at 0.95, adjust based on quality feedback.

FAQ

What vector database should I use?

For most cases: Pinecone (managed, easy setup) or pgvector (if you already use PostgreSQL). For self-hosted: Weaviate or Qdrant.

Does semantic caching work for all queries?

No. It works best for repetitive domains (FAQ, support, classification). Creative tasks, unique queries, and rapidly changing content benefit less.

How do I handle cache invalidation?

Set TTL (time-to-live) based on your content freshness needs. FAQ content might last months. News queries might need hours or days.

Conclusion

Semantic caching delivered 60% API call reduction with minimal implementation complexity. The key is choosing the right similarity threshold and embedding model for your use case.

:::tip Continue Reading:

For cost optimization strategies, see Cut AI API Costs 60%
For prompt compression, read AI Prompt Compression Techniques
For token calculation, see AI Token Calculation Guide
For infrastructure cost comparison, see the GPU Rental Index for provider pricing :::

References

PromptCost.org — AI API pricing data and analysis
OpenAI Pricing — GPT-4o API pricing
Anthropic API Pricing — Claude API pricing

Semantic Caching Explained: How We Reduced API Calls by 60%

Quick Answer

The Problem: Repeated API Calls

How Semantic Caching Works

Step 1: Embed the Query

Step 2: Store Responses

Step 3: Check Cache First

Production Results

Choosing Similarity Threshold

FAQ

What vector database should I use?

Does semantic caching work for all queries?

How do I handle cache invalidation?

Conclusion

References

Frequently Asked Questions

Quick Answer

The Problem: Repeated API Calls

How Semantic Caching Works

Step 1: Embed the Query

Step 2: Store Responses

Step 3: Check Cache First

Production Results

Choosing Similarity Threshold

FAQ

What vector database should I use?

Does semantic caching work for all queries?

How do I handle cache invalidation?

Conclusion

Related Posts

References

Frequently Asked Questions