Retrieving Context for LLMs
After finding relevant passages with search, use the retrieval endpoints to get document chunks and assemble context for your LLM prompts.
Retrieving chunks requires an API key with read permission. See Authentication for details.
Two Retrieval Methods
1. Get Chunks for a Document
GET /api/v1/chunks/:document_id returns all chunks for a specific document.
curl https://api.inherent.systems/api/v1/chunks/doc_abc123 \
-H "X-API-Key: $INHERENT_API_KEY"
{
"document_id": "doc_abc123",
"chunks": [
{
"chunk_id": "chk_x1y2z3",
"content": "Revenue grew 23% year-over-year in Q1 2026...",
"position": 0,
"token_count": 128
},
{
"chunk_id": "chk_a4b5c6",
"content": "The enterprise segment accounted for 60% of total revenue...",
"position": 1,
"token_count": 95
}
]
}
2. Get Full Document Context
GET /api/v1/chunks/:document_id/context returns document metadata, all chunks, and a concatenated full_text field — ready to drop into a prompt.
curl https://api.inherent.systems/api/v1/chunks/doc_abc123/context \
-H "X-API-Key: $INHERENT_API_KEY"
{
"document_id": "doc_abc123",
"document_name": "q1-2026-revenue-report.pdf",
"metadata": {"department": "finance", "quarter": "Q1-2026"},
"chunks": [
{
"chunk_id": "chk_x1y2z3",
"content": "Revenue grew 23% year-over-year in Q1 2026...",
"position": 0,
"token_count": 128
}
],
"full_text": "Revenue grew 23% year-over-year in Q1 2026... The enterprise segment accounted for 60% of total revenue...",
"total_tokens": 1847
}
Building LLM Prompts
The typical RAG flow is: search for relevant chunks, retrieve context, build a prompt, and send it to your LLM.
- OpenAI
- Anthropic Claude
import os
import requests
from openai import OpenAI
INHERENT_API_KEY = os.environ["INHERENT_API_KEY"]
BASE_URL = "https://api.inherent.systems/api/v1"
headers = {"X-API-Key": INHERENT_API_KEY, "Content-Type": "application/json"}
# 1. Search for relevant chunks
search_resp = requests.post(
f"{BASE_URL}/search",
headers=headers,
json={"query": "What was Q1 revenue?", "limit": 5, "min_score": 0.3},
)
results = search_resp.json()["results"]
# 2. Build context from search results
context = "\n\n---\n\n".join(
f"[Source: {r['document_name']}]\n{r['content']}" for r in results
)
# 3. Send to LLM
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Answer the user's question using only the provided context. "
"Cite the source document name for each claim."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: What was the revenue in Q1?",
},
],
)
print(response.choices[0].message.content)
import os
import requests
import anthropic
INHERENT_API_KEY = os.environ["INHERENT_API_KEY"]
BASE_URL = "https://api.inherent.systems/api/v1"
headers = {"X-API-Key": INHERENT_API_KEY, "Content-Type": "application/json"}
# 1. Search for relevant chunks
search_resp = requests.post(
f"{BASE_URL}/search",
headers=headers,
json={"query": "What was Q1 revenue?", "limit": 5, "min_score": 0.3},
)
results = search_resp.json()["results"]
# 2. Build context from search results
context = "\n\n---\n\n".join(
f"[Source: {r['document_name']}]\n{r['content']}" for r in results
)
# 3. Send to LLM
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=(
"Answer the user's question using only the provided context. "
"Cite the source document name for each claim."
),
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: What was the revenue in Q1?",
}
],
)
print(response.content[0].text)
Context Window Management
LLMs have finite context windows. Use the token_count on each chunk and total_tokens on the context endpoint to stay within limits.
MAX_CONTEXT_TOKENS = 8000 # budget for context within your model's limit
# Search and collect chunks within budget
search_results = requests.post(
f"{BASE_URL}/search",
headers=headers,
json={"query": user_question, "limit": 20, "min_score": 0.3},
).json()["results"]
selected = []
token_budget = 0
for result in search_results:
# Estimate tokens from content length (roughly 1 token per 4 chars)
estimated_tokens = len(result["content"]) // 4
if token_budget + estimated_tokens > MAX_CONTEXT_TOKENS:
break
selected.append(result)
token_budget += estimated_tokens
# Build prompt from selected chunks only
context = "\n\n---\n\n".join(
f"[Source: {r['document_name']}]\n{r['content']}" for r in selected
)
When using the /context endpoint, check total_tokens before including the full document. If it exceeds your budget, fall back to individual chunks from search results and select the top-scoring ones that fit.
Best Practices
- Preserve chunk ordering. Chunks are returned with a
positionfield indicating their order in the original document. Maintaining this order produces more coherent context for the LLM. - Budget your context window. Reserve tokens for the system prompt, the user question, and the model's response. A common split: 60% context, 10% system/question, 30% response.
- Use search scores to prioritize. When you have more relevant chunks than context budget, include the highest-scoring chunks first.
- Cite sources. Include document names in your prompt context so the LLM can reference them in its answer. This improves traceability.
- Retrieve full context sparingly. The
/contextendpoint is useful when you need the entire document, but for most RAG use cases, the top search results provide sufficient and more focused context.