MervCodes

Tech Reviews From A Programmer

Building AI-Powered Search for Your App

1 min read

Search is one of those features users never praise when it works and never forgive when it breaks. For years, "search" meant keyword matching: the user typed a word, and your database looked for rows containing that exact word. It worked, sort of, until someone searched for "cheap flights" and got nothing because your listings all said "affordable airfare." AI-powered search closes that gap. It understands meaning, not just characters. This guide walks through what that actually means and how to build it without drowning in machine learning theory.

What Makes Search "AI-Powered"

Traditional search is lexical. It indexes tokens and matches them. AI-powered search is semantic — it converts text into vectors (long lists of numbers called embeddings) that capture meaning. Two phrases that mean similar things land close together in this numeric space, even if they share no words.

Here's the mental model: imagine every sentence in your app plotted as a point in a giant multidimensional map. "How do I reset my password?" and "I forgot my login credentials" sit right next to each other on that map, while "chocolate cake recipe" sits far away. Semantic search finds the nearest points to the user's query rather than the points that share the same letters.

This unlocks three things keyword search can't do well:

  • Synonym handling without maintaining synonym lists
  • Conceptual matching, so "car" surfaces results about "vehicles" and "automobiles"
  • Natural-language queries, where users type full questions instead of guessing keywords

The Core Architecture

Most AI search systems follow the same pipeline. Understand these four stages and you understand the whole thing.

1. Chunking. You can't embed a 40-page document as one vector — you'd lose all detail. Split content into chunks (a paragraph, a few sentences, or a fixed token window). Chunk size is a real tradeoff: smaller chunks give precise matches but lose context; larger chunks keep context but blur relevance.

2. Embedding. Run each chunk through an embedding model to get its vector. Do this once, at indexing time, and store the result. You'll use the same model to embed user queries at search time — mixing models produces garbage because their vector spaces don't align.

3. Storage and indexing. Store vectors in a vector database (Pinecone, Weaviate, Qdrant, pgvector for Postgres users) that supports fast nearest-neighbor lookup. These use approximate nearest neighbor (ANN) algorithms so search stays fast even across millions of vectors.

4. Retrieval. Embed the incoming query, find the closest stored vectors, and return the associated content. Optionally re-rank the top results with a more expensive, more accurate model.

A Minimal Implementation

Here's the shape of it in Python, using an embedding API and pgvector:

import anthropic  # for downstream generation
from openai import OpenAI  # embeddings example
import psycopg2

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return resp.data[0].embedding

# Indexing: store each chunk's vector
def index_chunk(conn, doc_id, chunk):
    vector = embed(chunk)
    conn.execute(
        "INSERT INTO chunks (doc_id, content, embedding) VALUES (%s, %s, %s)",
        (doc_id, chunk, vector),
    )

# Searching: find the nearest chunks
def search(conn, query, limit=5):
    qvec = embed(query)
    return conn.execute(
        "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT %s",
        (qvec, limit),
    ).fetchall()

The <=> operator is pgvector's cosine-distance search. That single line is doing the "find nearest points on the map" work. Everything else is plumbing.

Hybrid Search: The Pragmatic Default

Pure semantic search has a weakness: it's fuzzy about exact matches. If a user searches for a specific SKU, error code, or product name, embeddings may return conceptually similar items instead of the exact one. Keyword search nails those cases.

The professional answer is hybrid search: run both a keyword (BM25/full-text) query and a semantic query, then merge the rankings. A common merge strategy is Reciprocal Rank Fusion, which combines the two ranked lists without needing the scores to be on the same scale. Most production search — the kind that actually satisfies users — is hybrid, not purely semantic. Start hybrid and you'll skip a painful round of "why can't it find the exact part number" bug reports.

Adding Generation (RAG)

Once you can retrieve relevant chunks, you're one step from a conversational answer. Retrieval-Augmented Generation (RAG) feeds the retrieved chunks to a language model like Claude and asks it to answer using only that context. Retrieval grounds the model in your actual data, which dramatically reduces hallucination.

msg = anthropic.Anthropic().messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Answer using only this context:\n{context}\n\nQuestion: {query}",
    }],
)

The key discipline: instruct the model to say "I don't know" when the context doesn't contain the answer, and cite which chunks it used. An answer with a source is trustworthy; a confident answer from nowhere is a liability.

Practical Advice From the Trenches

Measure relevance before you optimize. Build a small evaluation set of real queries with known-correct results. Without it, every tuning change is a guess. Track metrics like recall@k (did the right answer appear in the top k?).

Cache aggressively. Embedding the same query repeatedly wastes money and latency. Cache query embeddings and popular results.

Watch your chunking. More search failures trace back to bad chunking than to the model. If answers feel truncated or context-free, revisit chunk size and add overlap between chunks.

Keep the index fresh. When source content changes, re-embed it. Stale vectors silently serve outdated answers. Build re-indexing into your content update flow from day one.

Mind the cost curve. Embedding is cheap; generation is not. Retrieve with cheap embeddings, and only invoke the expensive LLM when you actually need a synthesized answer.

Don't over-engineer the start. For under ~100k documents, pgvector on your existing Postgres is plenty. Reach for a dedicated vector database when scale, filtering complexity, or latency demands it — not before.

Common Pitfalls

The biggest one is treating AI search as a drop-in replacement rather than a complement. Users still expect filters, sorting, and exact matches to work. Semantic search should augment those, not replace them. The second is ignoring latency: embedding, retrieval, re-ranking, and generation each add milliseconds, and they compound. Set a latency budget and profile against it.

FAQ

Do I need to train my own model? Almost never. Off-the-shelf embedding models and hosted LLMs cover the vast majority of use cases. Fine-tuning is a last resort for highly specialized domains, and even then, better chunking and hybrid search usually deliver more improvement for less effort.

How much does this cost to run? Embeddings are inexpensive — often a fraction of a cent per thousand tokens — and you pay for them once at indexing time. The larger cost is generation if you add RAG. Cache results and only generate answers when needed to keep bills predictable.

Vector database or just Postgres? Start with pgvector if you already use Postgres and have fewer than a few hundred thousand documents. It keeps your stack simple. Move to a dedicated vector database when you need advanced metadata filtering, horizontal scale, or sub-50ms retrieval across millions of vectors.

How do I stop the AI from making things up? Ground it with retrieval, instruct it to answer only from provided context, require citations, and give it explicit permission to say "I don't know." Test with adversarial queries whose answers aren't in your data and confirm it declines rather than inventing.

What's the single most impactful thing to get right? Chunking and your evaluation set. Good chunks give the model something coherent to match and cite, and a real eval set tells you whether any change actually helped. Nail those two and everything else is tuning.

Wrapping Up

AI-powered search isn't magic, and it isn't research-grade machine learning either. It's a well-understood pipeline: chunk, embed, store, retrieve — with hybrid ranking for robustness and optional generation for conversational answers. Start small with a hybrid approach on your existing database, measure relevance with real queries, and layer in complexity only when your metrics justify it. Your users won't thank you for it, but they'll stop searching for "affordable airfare" and getting nothing — and that silence is the highest praise search ever gets.

Sources

Related Articles

How to Set Up AI Code Review in GitHub Actions (2026 Guide)

Wire an AI code reviewer into GitHub Actions the right way — trigger on pull requests, post inline comments, keep secrets safe, and avoid the noisy-bot trap. Complete working workflow included.

AI Code Review Prompts That Actually Work (With Examples)

The quality of an AI code review is decided almost entirely by the prompt. Review prompt patterns that produce signal instead of noise — copy-paste examples for bugs, security, and PR-level review.

AI Code Review vs Human Code Review: When to Use Each (2026)

AI code review and human review aren't competitors — they're a division of labour. What each is good at, where each fails, and how to combine them so you ship faster without lowering the bar.