Retrieving Context for LLMs

After finding relevant passages with search, use the retrieval endpoints to get document chunks and assemble context for your LLM prompts.

note

Retrieving chunks requires an API key with read permission. See Authentication for details.

Two Retrieval Methods

1. Get Chunks for a Document

GET /api/v1/chunks/:document_id returns all chunks for a specific document.

curl https://api.inherent.systems/api/v1/chunks/doc_abc123 \
  -H "X-API-Key: $INHERENT_API_KEY"

{
  "document_id": "doc_abc123",
  "chunks": [
    {
      "chunk_id": "chk_x1y2z3",
      "content": "Revenue grew 23% year-over-year in Q1 2026...",
      "position": 0,
      "token_count": 128
    },
    {
      "chunk_id": "chk_a4b5c6",
      "content": "The enterprise segment accounted for 60% of total revenue...",
      "position": 1,
      "token_count": 95
    }
  ]
}

2. Get Full Document Context

GET /api/v1/chunks/:document_id/context returns document metadata, all chunks, and a concatenated full_text field — ready to drop into a prompt.

curl https://api.inherent.systems/api/v1/chunks/doc_abc123/context \
  -H "X-API-Key: $INHERENT_API_KEY"

{
  "document_id": "doc_abc123",
  "document_name": "q1-2026-revenue-report.pdf",
  "metadata": {"department": "finance", "quarter": "Q1-2026"},
  "chunks": [
    {
      "chunk_id": "chk_x1y2z3",
      "content": "Revenue grew 23% year-over-year in Q1 2026...",
      "position": 0,
      "token_count": 128
    }
  ],
  "full_text": "Revenue grew 23% year-over-year in Q1 2026... The enterprise segment accounted for 60% of total revenue...",
  "total_tokens": 1847
}

Building LLM Prompts

The typical RAG flow is: search for relevant chunks, retrieve context, build a prompt, and send it to your LLM.

OpenAI
Anthropic Claude

import os
import requests
from openai import OpenAI

INHERENT_API_KEY = os.environ["INHERENT_API_KEY"]
BASE_URL = "https://api.inherent.systems/api/v1"
headers = {"X-API-Key": INHERENT_API_KEY, "Content-Type": "application/json"}

# 1. Search for relevant chunks
search_resp = requests.post(
    f"{BASE_URL}/search",
    headers=headers,
    json={"query": "What was Q1 revenue?", "limit": 5, "min_score": 0.3},
)
results = search_resp.json()["results"]

# 2. Build context from search results
context = "\n\n---\n\n".join(
    f"[Source: {r['document_name']}]\n{r['content']}" for r in results
)

# 3. Send to LLM
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "Answer the user's question using only the provided context. "
                "Cite the source document name for each claim."
            ),
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: What was the revenue in Q1?",
        },
    ],
)

print(response.choices[0].message.content)

import os
import requests
import anthropic

INHERENT_API_KEY = os.environ["INHERENT_API_KEY"]
BASE_URL = "https://api.inherent.systems/api/v1"
headers = {"X-API-Key": INHERENT_API_KEY, "Content-Type": "application/json"}

# 1. Search for relevant chunks
search_resp = requests.post(
    f"{BASE_URL}/search",
    headers=headers,
    json={"query": "What was Q1 revenue?", "limit": 5, "min_score": 0.3},
)
results = search_resp.json()["results"]

# 2. Build context from search results
context = "\n\n---\n\n".join(
    f"[Source: {r['document_name']}]\n{r['content']}" for r in results
)

# 3. Send to LLM
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=(
        "Answer the user's question using only the provided context. "
        "Cite the source document name for each claim."
    ),
    messages=[
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: What was the revenue in Q1?",
        }
    ],
)

print(response.content[0].text)

Context Window Management

LLMs have finite context windows. Use the token_count on each chunk and total_tokens on the context endpoint to stay within limits.

MAX_CONTEXT_TOKENS = 8000  # budget for context within your model's limit

# Search and collect chunks within budget
search_results = requests.post(
    f"{BASE_URL}/search",
    headers=headers,
    json={"query": user_question, "limit": 20, "min_score": 0.3},
).json()["results"]

selected = []
token_budget = 0

for result in search_results:
    # Estimate tokens from content length (roughly 1 token per 4 chars)
    estimated_tokens = len(result["content"]) // 4
    if token_budget + estimated_tokens > MAX_CONTEXT_TOKENS:
        break
    selected.append(result)
    token_budget += estimated_tokens

# Build prompt from selected chunks only
context = "\n\n---\n\n".join(
    f"[Source: {r['document_name']}]\n{r['content']}" for r in selected
)

tip

When using the /context endpoint, check total_tokens before including the full document. If it exceeds your budget, fall back to individual chunks from search results and select the top-scoring ones that fit.

Best Practices

Preserve chunk ordering. Chunks are returned with a position field indicating their order in the original document. Maintaining this order produces more coherent context for the LLM.
Budget your context window. Reserve tokens for the system prompt, the user question, and the model's response. A common split: 60% context, 10% system/question, 30% response.
Use search scores to prioritize. When you have more relevant chunks than context budget, include the highest-scoring chunks first.
Cite sources. Include document names in your prompt context so the LLM can reference them in its answer. This improves traceability.
Retrieve full context sparingly. The /context endpoint is useful when you need the entire document, but for most RAG use cases, the top search results provide sufficient and more focused context.

Two Retrieval Methods​

1. Get Chunks for a Document​

2. Get Full Document Context​

Building LLM Prompts​

Context Window Management​

Best Practices​