Searching Your Knowledge Base
Inherent provides three search modes across every document in your workspace: semantic (vector similarity), hybrid (BM25 + vector fusion), and keyword (pure BM25). Pick the mode that matches your query shape.
Searching requires an API key with search permission. See Authentication for details.
How Search Works
- Your query is routed to the selected mode (
semanticby default). - Semantic: the query is converted into a vector embedding and matched against chunk embeddings via Weaviate
nearVector. - Hybrid: BM25 term-frequency scoring is fused with vector similarity, weighted by
alpha(default0.7, vector-heavy). - Keyword: pure BM25 -- no embeddings needed.
- Results are scored, ranked, and returned with the matching chunk content.
This means a natural-language query like "revenue growth in Q1" will find passages about "first quarter earnings increase" (semantic/hybrid), while an exact query like ErrRateLimitExceeded will find literal occurrences (keyword/hybrid).
Search Request
Send a POST request to /v1/search with a JSON body.
- cURL
- Python
- JavaScript
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What was the revenue growth in Q1?",
"limit": 5,
"min_score": 0.3
}'
import requests
import os
response = requests.post(
"https://api.inherent.sh/v1/search",
headers={
"X-API-Key": os.environ["INHERENT_API_KEY"],
"Content-Type": "application/json",
},
json={
"query": "What was the revenue growth in Q1?",
"limit": 5,
"min_score": 0.3,
},
)
results = response.json()["results"]
for r in results:
print(f"[{r['score']:.2f}] {r['document_name']}: {r['content'][:100]}")
const response = await fetch("https://api.inherent.sh/v1/search", {
method: "POST",
headers: {
"X-API-Key": process.env.INHERENT_API_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({
query: "What was the revenue growth in Q1?",
limit: 5,
min_score: 0.3,
}),
});
const { results } = await response.json();
results.forEach((r) => {
console.log(`[${r.score.toFixed(2)}] ${r.document_name}: ${r.content.slice(0, 100)}`);
});
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query | string | Yes | — | The search query. Maximum 1,000 characters. |
limit | integer | No | 10 | Number of results to return. Range: 1–100. |
min_score | float | No | 0.0 | Minimum relevance score. Range: 0.0–1.0. Results below this threshold are excluded. |
document_ids | string[] | No | — | Restrict search to specific documents by ID. |
search_mode | string | No | "semantic" | One of "semantic", "hybrid", or "keyword". |
alpha | float | No | 0.7 | Hybrid fusion weight (1.0 = vector-heavy, 0.0 = keyword-heavy). Ignored unless search_mode="hybrid". |
include_context | boolean | No | false | If true, includes neighbouring chunks in context_before / context_after. |
context_window | integer | No | 2 | Chunks before AND after each match (0–5). Ignored when include_context=false. |
Choosing a Search Mode
The search_mode field controls which retrieval engine the server uses. The response always echoes back search_mode so you can confirm routing.
| Mode | Best for |
|---|---|
semantic (default) | Natural-language questions, paraphrased queries, conceptual lookups |
hybrid | Queries that mix prose and literals -- e.g. "rate limit error 429" |
keyword | Exact-match queries -- error codes, identifiers, code snippets |
- Semantic (default)
- Hybrid
- Keyword
# Semantic is the default -- no need to specify search_mode
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What was the revenue growth in Q1?",
"limit": 5
}'
# Hybrid: BM25 fused with vector similarity
# alpha=0.7 (default) = vector-weighted; alpha=0.0 = pure keyword
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "ErrRateLimitExceeded connection reset",
"search_mode": "hybrid",
"alpha": 0.7,
"limit": 5
}'
# Pure BM25 -- great for identifiers and error codes
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "ink_",
"search_mode": "keyword",
"limit": 10
}'
Semantic and hybrid modes require chunk embeddings to be present in Weaviate. Keyword mode has full retrieval quality on all environments. Semantic/hybrid quality improves once ingestion-side chunk embeddings land (tracked as ENG-S083).
Context-aware Results
Set include_context: true to receive the chunks immediately before and after each match inside the same document. This gives your LLM coherent surrounding context without a second round-trip.
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What was the revenue growth in Q1?",
"limit": 5,
"include_context": true,
"context_window": 2
}'
The response for each result includes context_before and context_after arrays (up to context_window chunks on each side), and the top-level total_tokens field sums token_count across every returned chunk so you can plan LLM prompt budgets without re-tokenising client-side.
{
"query": "What was the revenue growth in Q1?",
"total_results": 1,
"processing_time_ms": 52.1,
"search_mode": "semantic",
"total_tokens": 980,
"results": [
{
"chunk_id": "chk_x1y2z3",
"document_id": "doc_abc123",
"document_name": "q1-2026-revenue-report.pdf",
"content": "Revenue grew 23% year-over-year in Q1 2026...",
"score": 0.87,
"metadata": { "quarter": "Q1-2026" },
"context_before": [
{
"chunk_id": "chk_x1y2z1",
"chunk_index": 11,
"content": "The board approved the Q1 forecast in January...",
"token_count": 195
}
],
"context_after": [
{
"chunk_id": "chk_x1y2z4",
"chunk_index": 13,
"content": "Enterprise segment drove the majority of the increase...",
"token_count": 210
}
]
}
]
}
Key rules:
context_window: 0-- arrays are present but empty ([]).include_context: false(default) --context_before/context_afterarenulland no extra DB query runs.- Arrays are clamped at document edges (first/last chunk gets an empty neighbour array on that side).
Response
The response top level includes search_mode (echoed back) and total_tokens (sum of all chunk token counts, useful for LLM prompt budgeting):
{
"query": "What was the revenue growth in Q1?",
"total_results": 2,
"processing_time_ms": 48.3,
"search_mode": "semantic",
"total_tokens": 720,
"results": [
{
"chunk_id": "chk_x1y2z3",
"document_id": "doc_abc123",
"document_name": "q1-2026-revenue-report.pdf",
"content": "Revenue grew 23% year-over-year in Q1 2026, driven primarily by expansion in the enterprise segment...",
"score": 0.87,
"metadata": {
"department": "finance",
"quarter": "Q1-2026"
},
"context_before": null,
"context_after": null
},
{
"chunk_id": "chk_a4b5c6",
"document_id": "doc_def456",
"document_name": "board-deck-march-2026.pdf",
"content": "First quarter results exceeded projections by 8%, with revenue reaching $4.2M...",
"score": 0.72,
"metadata": {},
"context_before": null,
"context_after": null
}
]
}
Tuning Search Results
Precision vs. Recall
- Higher
min_score(e.g., 0.7): Fewer results, but each one is highly relevant. Good for RAG pipelines where you need precise context. - Lower
min_score(e.g., 0.1) or omitted: More results, including tangentially related passages. Good for exploratory search. - Smaller
limit: Return only the top matches. Useful when feeding results into an LLM with a limited context window. - Larger
limit: Cast a wider net. Useful when you need comprehensive coverage of a topic.
Filtering by Document
Use document_ids to restrict search to specific documents. This is useful when you know which documents are relevant and want to avoid noise from unrelated content.
curl -X POST https://api.inherent.sh/v1/search \
-H "X-API-Key: $INHERENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "quarterly revenue",
"document_ids": ["doc_abc123", "doc_def456"],
"limit": 10
}'
Best Practices
- Write natural language queries for semantic/hybrid mode. "What was the revenue in Q1?" works better than "revenue Q1" because the embedding model captures more meaning from complete sentences.
- Use keyword or hybrid mode for exact tokens. Error codes, SDK function names, and identifiers are better served by BM25 (
keyword) or a fusion (hybrid) than pure vector similarity. - Be specific. "How did the enterprise segment perform in Q1 2026?" returns more targeted results than "enterprise performance."
- Start with a moderate
min_score. A threshold of 0.3 is a reasonable starting point. Adjust up or down based on result quality. - Use
include_contextfor RAG. Passing neighbouring chunks to your LLM reduces hallucination and provides coherent passage context. Usetotal_tokensto stay within the model's context window. - Handle empty results gracefully. If
resultsis empty, the query did not match any chunks above yourmin_scorethreshold. Try lowering the threshold, rephrasing the query, or switching modes. - Combine search with context retrieval. Use search to find the most relevant chunks, then use the context retrieval endpoints to get the full document context for your LLM.