RAG System Integration Guide
This guide shows you how to instrument Retrieval-Augmented Generation (RAG) systems with Lumina to get full observability across your entire pipeline.
What is RAG?
RAG (Retrieval-Augmented Generation) systems combine:
- Retrieval: Searching a knowledge base (vector database) for relevant context
- Generation: Using an LLM to generate a response based on the retrieved context
Why Instrument RAG with Lumina?
RAG pipelines have unique observability challenges:
- Two cost components: Vector search + LLM generation
- Quality issues: Wrong context retrieved OR LLM ignores context
- Performance bottlenecks: Slow retrieval OR slow generation
- Context bloat: Too much context = higher costs + lower quality
Lumina helps you:
- ✅ Track costs for both retrieval and generation
- ✅ Detect when context quality degrades
- ✅ Alert when context length grows unexpectedly
- ✅ Compare RAG pipeline versions (prompt changes, retrieval strategies)
Basic RAG Instrumentation
Setup
# Install Lumina SDK
npm install @lumina/sdk
# Install your RAG dependencies
npm install @pinecone-database/pinecone @anthropic-ai/sdk
Simple RAG Example
import { initLumina } from '@lumina/sdk';
import { Pinecone } from '@pinecone-database/pinecone';
import Anthropic from '@anthropic-ai/sdk';
const lumina = initLumina({
api_key: process.env.LUMINA_API_KEY,
environment: 'production',
});
const pinecone = new Pinecone();
const anthropic = new Anthropic();
export async function answerQuestion(question: string) {
// Lumina creates ONE parent trace for the entire RAG flow
return await lumina.trace(
async () => {
// Step 1: Retrieval (child span)
const context = await lumina.trace(
async () => {
const embedding = await getEmbedding(question);
const results = await pinecone.index('knowledge-base').query({
vector: embedding,
topK: 5,
includeMetadata: true,
});
return results.matches.map((m) => m.metadata?.text).join('\n\n');
},
{
name: 'rag-retrieval',
attributes: {
'rag.query': question,
'rag.num_results': 5,
'rag.vector_db': 'pinecone',
'rag.index': 'knowledge-base',
},
}
);
// Step 2: Generation (child span)
const response = await lumina.trace(
async () => {
return await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Use the following context to answer the question.
Context:
${context}
Question: ${question}
Answer:`,
},
],
});
},
{
name: 'rag-generation',
attributes: {
'rag.context_length': context.length,
'rag.context_tokens': Math.ceil(context.length / 4), // rough estimate
},
}
);
return response.content[0].text;
},
{
name: 'rag-pipeline',
attributes: {
'rag.type': 'simple',
'rag.version': '1.0',
},
}
);
}
What You’ll See in Lumina
📊 RAG Pipeline (total: 1.2s, cost: $0.016)
├─ 🔍 rag-retrieval (850ms, $0.001)
│ ├─ Query: "What is the capital of France?"
│ ├─ Results: 5 documents
│ └─ Vector DB: Pinecone
└─ ✍️ rag-generation (350ms, $0.015)
├─ Model: claude-3-5-sonnet
├─ Context: 1,245 tokens
└─ Response: 87 tokens
Advanced RAG Patterns
Multi-Hop RAG
When your RAG system does multiple retrieval rounds:
export async function multiHopRAG(question: string) {
return await lumina.trace(
async () => {
// Hop 1: Initial retrieval
const initialContext = await lumina.trace(
async () => {
// ... retrieval logic
},
{
name: 'rag-retrieval-hop-1',
attributes: { 'rag.hop': 1 },
}
);
// Hop 2: Query refinement
const refinedQuery = await lumina.trace(
async () => {
// Use LLM to refine query based on initial context
return await refineQuery(question, initialContext);
},
{
name: 'rag-query-refinement',
}
);
// Hop 3: Second retrieval with refined query
const refinedContext = await lumina.trace(
async () => {
// ... retrieval logic with refinedQuery
},
{
name: 'rag-retrieval-hop-2',
attributes: { 'rag.hop': 2 },
}
);
// Final generation
const response = await lumina.trace(
async () => {
// ... generation logic
},
{ name: 'rag-generation' }
);
return response;
},
{ name: 'rag-multi-hop-pipeline' }
);
}
RAG with Reranking
export async function ragWithReranking(question: string) {
return await lumina.trace(async () => {
// Initial retrieval (broad)
const candidates = await lumina.trace(async () => {
return await vectorSearch(question, topK: 20);
}, {
name: 'rag-retrieval-candidates',
attributes: { 'rag.candidates': 20 }
});
// Reranking
const reranked = await lumina.trace(async () => {
return await rerank(question, candidates, topK: 5);
}, {
name: 'rag-reranking',
attributes: {
'rag.reranker': 'cohere',
'rag.final_count': 5
}
});
// Generation with top results
const response = await lumina.trace(async () => {
// ... generation logic
}, {
name: 'rag-generation',
attributes: { 'rag.used_reranking': true }
});
return response;
}, { name: 'rag-reranking-pipeline' });
}
RAG Cost Optimization
Track Context Size
const context = await lumina.trace(
async () => {
const results = await vectorSearch(question);
const contextText = results.join('\n\n');
return contextText;
},
{
name: 'rag-retrieval',
attributes: {
'rag.context_chars': contextText.length,
'rag.context_tokens': estimateTokens(contextText),
'rag.documents_retrieved': results.length,
},
}
);
Alert on Context Bloat
In your Lumina dashboard, set up alerts:
- Context length > 3000 tokens (warning)
- Context length > 5000 tokens (critical)
- Cost per request > $0.05 (warning)
Compare RAG Strategies
Use Lumina’s replay feature to test:
// Strategy A: Retrieve 10 documents
const strategyA = { topK: 10, rerank: false };
// Strategy B: Retrieve 20, rerank to 5
const strategyB = { topK: 20, rerank: true, finalK: 5 };
// Capture 100 production queries
// Run replay with strategyA vs strategyB
// Compare:
// - Average cost
// - Response quality (semantic similarity)
// - Latency
RAG Quality Monitoring
Track Retrieval Quality
const context = await lumina.trace(
async () => {
const results = await vectorSearch(question);
// Track retrieval relevance scores
const avgScore = results.reduce((sum, r) => sum + r.score, 0) / results.length;
return {
context: results.map((r) => r.text).join('\n'),
avgRelevance: avgScore,
};
},
{
name: 'rag-retrieval',
attributes: {
'rag.avg_relevance_score': avgScore,
'rag.min_relevance_score': Math.min(...results.map((r) => r.score)),
},
}
);
Detect Context Utilization
Lumina’s semantic scorer can detect when the LLM ignores context:
// Lumina automatically tracks:
// - Did the response use the retrieved context?
// - Semantic similarity to expected output
// - Hallucination detection
// You'll get alerts like:
// "🚨 RAG quality drop: Response ignoring context (score: 0.58)"
Vector Database Integration Examples
Pinecone
import { Pinecone } from '@pinecone-database/pinecone';
const pinecone = new Pinecone();
const index = pinecone.index('knowledge-base');
const context = await lumina.trace(
async () => {
const embedding = await getEmbedding(query);
const results = await index.query({
vector: embedding,
topK: 5,
includeMetadata: true,
});
return results.matches;
},
{
name: 'rag-retrieval-pinecone',
attributes: {
'rag.vector_db': 'pinecone',
'rag.index': 'knowledge-base',
'rag.dimension': 1536,
},
}
);
Weaviate
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
const context = await lumina.trace(
async () => {
const results = await client.graphql
.get()
.withClassName('Document')
.withNearText({ concepts: [query] })
.withLimit(5)
.withFields('content _additional { distance }')
.do();
return results.data.Get.Document;
},
{
name: 'rag-retrieval-weaviate',
attributes: {
'rag.vector_db': 'weaviate',
'rag.class': 'Document',
},
}
);
Chroma
import { ChromaClient } from 'chromadb';
const client = new ChromaClient();
const collection = await client.getCollection({ name: 'documents' });
const context = await lumina.trace(
async () => {
const results = await collection.query({
queryTexts: [query],
nResults: 5,
});
return results.documents[0];
},
{
name: 'rag-retrieval-chroma',
attributes: {
'rag.vector_db': 'chroma',
'rag.collection': 'documents',
},
}
);
Supabase Vector (pgvector)
import { createClient } from '@supabase/supabase-js';
const supabase = createClient(url, key);
const context = await lumina.trace(
async () => {
const embedding = await getEmbedding(query);
const { data } = await supabase.rpc('match_documents', {
query_embedding: embedding,
match_count: 5,
});
return data;
},
{
name: 'rag-retrieval-supabase',
attributes: {
'rag.vector_db': 'supabase-pgvector',
},
}
);
RAG Dashboard Metrics
Once instrumented, Lumina tracks:
Cost Metrics
- Total RAG cost = Retrieval cost + Generation cost
- Cost breakdown: What % is retrieval vs generation?
- Cost per document: Are you retrieving too many?
Performance Metrics
- Retrieval latency: How long does vector search take?
- Generation latency: How long does the LLM take?
- Total latency: End-to-end RAG pipeline time
Quality Metrics
- Retrieval relevance: Are the right documents being retrieved?
- Context utilization: Is the LLM using the context?
- Response quality: Semantic similarity to expected output
Alerts You’ll Get
- 🚨 Cost spike: “RAG cost +45% (context grew from 1K → 3K tokens)”
- ⚠️ Quality drop: “Response ignoring context (semantic score: 0.62)”
- 🔔 Latency spike: “Retrieval taking 2.5s (baseline: 800ms)”
Best Practices
1. Always Use Parent-Child Spans
// ✅ Good: Clear hierarchy
lumina.trace(
async () => {
await lumina.trace(retrieval, { name: 'retrieval' });
await lumina.trace(generation, { name: 'generation' });
},
{ name: 'rag-pipeline' }
);
// ❌ Bad: Flat traces
lumina.trace(retrieval);
lumina.trace(generation);
2. Add Meaningful Attributes
// ✅ Good: Rich context
{
'rag.query': question,
'rag.num_results': 5,
'rag.context_tokens': 1245,
'rag.vector_db': 'pinecone',
}
// ❌ Bad: No context
{ name: 'retrieval' }
3. Track Retrieval Quality
// ✅ Good: Track relevance scores
attributes: {
'rag.avg_relevance': 0.87,
'rag.min_relevance': 0.72,
}
// ❌ Bad: No quality metrics
4. Version Your RAG Pipeline
// ✅ Good: Track versions
attributes: {
'rag.version': '2.1',
'rag.strategy': 'reranking',
}
// Use Lumina's replay to compare v2.0 vs v2.1
Troubleshooting
High Costs?
Check Lumina dashboard:
- Is context length growing? (alert: context > 3K tokens)
- Are you retrieving too many documents? (reduce topK)
- Are you using expensive models? (try Claude Haiku for simple queries)
Low Quality?
Check Lumina dashboard:
- Retrieval relevance scores low? (improve embeddings/indexing)
- Response ignoring context? (improve prompt engineering)
- Hallucinations? (reduce temperature, improve context)
Slow Performance?
Check Lumina dashboard:
- Retrieval slow? (add vector DB indexes, reduce topK)
- Generation slow? (reduce context length, use faster model)
- Both slow? (consider caching common queries)
Next Steps
- ✅ Instrument your RAG pipeline with Lumina
- 📊 Set up alerts for cost spikes and quality drops
- 🔄 Use replay to test RAG improvements
- 📈 Track metrics over time to optimize
Need help? Check out: