Skip to main content
Cost Optimization

Semantic Caching Explained: How We Reduced API Calls by 60%

Learn how semantic caching works to reduce AI API costs by 60%. Using vector embeddings to match semantically similar queries and return cached responses.

P

PromptCost Engineering Team

Lead AI infrastructure engineers who have collectively spent over $500k on API bills across 12 production deployments.

Semantic Caching Explained: How We Reduced API Calls by 60%

Quick Answer

Semantic caching uses vector embeddings to match similar queries and return cached responses instead of calling the API. Achieved 60% API call reduction in production. Key components: vector DB, embedding model, similarity threshold (0.95+).


The Problem: Repeated API Calls

In production AI systems, 30-50% of queries are semantically identical:

  • “Reset my password” = “I forgot my password” = “Can’t access account”
  • “What’s the price?” = “How much does it cost?” = “Price check”

These all get the same response but hit the API every time.


How Semantic Caching Works

User Query -> Embed Query -> Search Vector DB -> Similar Found?
                                                    |
                                    Yes ------------+------------ No
                                    |                              |
                              Return Cached                   Call AI API
                              Response                        Store Response

Step 1: Embed the Query

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_query(text: str) -> list[float]:
    return model.encode(text).tolist()

Step 2: Store Responses

async def store_response(query: str, response: str, cache):
    embedding = embed_query(query)
    await cache.insert(
        id=hash_query(query),
        vector=embedding,
        response=response,
        metadata={"created": now()}
    )

Step 3: Check Cache First

async def get_cached_or_compute(query: str, compute_fn, cache, threshold=0.95):
    embedding = embed_query(query)

    results = await cache.search(
        vector=embedding,
        top_k=1,
        threshold=threshold
    )

    if results and results[0]['score'] >= threshold:
        return results[0]['response'], True  # Cache hit

    # Miss - compute and store
    response = await compute_fn(query)
    await store_response(query, response, cache)
    return response, False

Production Results

After implementing semantic caching:

MonthHit RateAPI Calls SavedMonthly Savings
138%114,000$1,520
245%135,000$1,800
352%156,000$2,080
658%174,000$2,320

Total: 58% cache hit rate = $2,320/month savings


Choosing Similarity Threshold

ThresholdHit RateQuality Risk
0.99Very lowNone
0.95MediumLow
0.90HighMedium
0.85Very highHigh

Recommendation: Start at 0.95, adjust based on quality feedback.


FAQ

What vector database should I use?

For most cases: Pinecone (managed, easy setup) or pgvector (if you already use PostgreSQL). For self-hosted: Weaviate or Qdrant.

Does semantic caching work for all queries?

No. It works best for repetitive domains (FAQ, support, classification). Creative tasks, unique queries, and rapidly changing content benefit less.

How do I handle cache invalidation?

Set TTL (time-to-live) based on your content freshness needs. FAQ content might last months. News queries might need hours or days.


Conclusion

Semantic caching delivered 60% API call reduction with minimal implementation complexity. The key is choosing the right similarity threshold and embedding model for your use case.

:::tip Continue Reading:

References

Frequently Asked Questions

What is semantic caching?

Semantic caching stores AI responses indexed by vector embeddings. When a new query arrives, we embed it and search for similar cached queries (above 0.95 similarity threshold). If found, we return the cached response instead of calling the API.

How much can semantic caching save?

In production, we achieved 45-60% cache hit rate, reducing API calls by that amount. For FAQ-heavy applications, hit rates can exceed 70%. Savings depend on query diversity.

What is a good similarity threshold?

0.95 similarity threshold works well for most cases. Higher (0.97+) means stricter matching, lower hit rates. Lower (0.90) means more false positives, lower quality responses.

How do I implement semantic caching?

Use a vector database (Pinecone, Weaviate, or pgvector), embed queries with sentence transformers, store with response. On new query, embed and search for similar. Return cached response if similarity exceeds threshold.