SS.001Semantic Search

Engineering notes

How semantic search actually works.

← back to demo

One Vercel deploy, no external vector database. Every search runs as a Node.js serverless function. Below: the complete pipeline from raw text to ranked, highlighted catalog cards — plus the engineering decisions and the gaps that are honestly disclosed.

Pipeline diagram

┌───────────────────────────────────────────────────────────────────┐
  │  POST /api/index                                                   │
  │                                                                   │
  │  { docs: [ { title, content, url? } … ] }   ← JSON body          │
  │          │                                                        │
  │          ▼                                                        │
  │  [01] Zod validation                                              │
  │       • max 50 docs  • max 200,000 total chars                    │
  │       • title max 500 chars, content max 100,000 chars            │
  │          │                                                        │
  │          ▼                                                        │
  │  [02] Rate limit — Upstash sliding window 100 req/day/IP          │
  │          │                                                        │
  │          ▼                                                        │
  │  [03] Chunk text — ~500 tok / 50 tok overlap                      │
  │       sentence-boundary preference, word-boundary fallback        │
  │          │                                                        │
  │          ▼                                                        │
  │  [04] Batch embed — Vertex text-embedding-004 (768-dim)           │
  │       up to 10 chunks / 8 000 est-tokens per batch                │
  │       exponential backoff on 429 / UNAVAILABLE                    │
  │          │                                                        │
  │          ▼                                                        │
  │  [05] Store — in-memory module-level array of StoredChunk         │
  │       { id, docIndex, chunkIndex, title, url, text, vector }      │
  └───────────────────────────────────────────────────────────────────┘

  ┌───────────────────────────────────────────────────────────────────┐
  │  POST /api/search                                                  │
  │                                                                   │
  │  { query: string 2–300 chars }              ← JSON body          │
  │          │                                                        │
  │          ▼                                                        │
  │  [01] Zod validation + rate limit (shared window)                 │
  │          │                                                        │
  │          ▼                                                        │
  │  [02] Embed query — Vertex text-embedding-004                     │
  │          │                                                        │
  │          ▼                                                        │
  │  [03] Cosine similarity over all stored chunk vectors             │
  │       score = (A·B) / (|A||B|)                                    │
  │          │                                                        │
  │          ▼                                                        │
  │  [04] Deduplicate — one result per source document (best chunk)   │
  │       Return top-5 by score                                       │
  │          │                                                        │
  │          ▼                                                        │
  │  [05] Highlight — HTML-escape first, then insert <mark> tags      │
  │       ESCAPE → MARK order prevents XSS from doc content           │
  │          │                                                        │
  │          ▼                                                        │
  │  { results: [ { title, url?, snippet (HTML), score } ] }         │
  └───────────────────────────────────────────────────────────────────┘

The pipeline, step by step.

  1. 01

    Zod input validation

    Every API request is parsed through a strict zod schema before any processing begins. For /api/index: maximum 50 documents, 200,000 total characters, 100,000 chars per document, 500-char title cap. For /api/search: query must be 2–300 characters. Malformed or oversized requests are rejected immediately with a typed error code — no stack traces reach the client.
  2. 02

    Rate limiting

    Upstash Redis sliding-window limiter: 100 requests per IP per 24 hours, prefix rl:search. Gracefully degrades to a no-op when Upstash is not configured (development mode), so the app never hard-crashes on missing env vars. The IP is read from x-forwarded-for (first hop only) or x-real-ip.
  3. 03

    Text chunking

    Long documents are split into ~500-token chunks (≈2000 chars) with a 50-token overlap (≈200 chars). Chunking prefers sentence boundaries (". "), falling back to word boundaries, keeping semantically coherent windows. Overlap ensures that a concept split across a boundary still appears in full in at least one chunk.
  4. 04

    Batched embedding via Vertex AI

    Chunks are embedded in batches of up to 10 items / 8,000 estimated tokens using Google's text-embedding-004 model (768-dimensional output). Batching is necessary because Vertex has per-request token limits. The estimator is deliberately pessimistic (chars/2) to handle Vietnamese, CJK, and dense technical text without overrunning the quota. Transient errors (429, RESOURCE_EXHAUSTED, UNAVAILABLE) trigger exponential backoff up to 5 retries.
  5. 05

    In-memory vector store

    Vectors are stored in a module-level array in the serverless function. Each StoredChunk carries its original text, title, optional URL, and the full 768-dimensional float vector. The store is ephemeral: it resets on cold start, instance recycling, or redeployment. The production upgrade path is Supabase pgvector — the cosine similarity becomes a <=> operator call; the top-k is a single indexed ANN scan.
  6. 06

    Cosine similarity ranking

    The query is embedded with the same model. Cosine similarity is computed against every stored chunk vector: score = (A·B) / (|A|·|B|). Chunks are ranked descending; one result per source document is returned (best-scoring chunk wins), preventing a single long document from dominating the top-5 results.
  7. 07

    XSS-safe snippet extraction

    This is the most security-critical step. Document content is untrusted — it may contain <script>, <img onerror=…>, or other injection payloads. The pipeline always HTML-escapes the raw text window before inserting any <mark> tags. Escape-then-mark means the <mark> tags are constructed by the application, never from document content. A Vitest spec verifies this invariant.

Security stance

What is defended. What is not.

Defended

  • Input size caps (docs, chars, query length) — enforced by zod
  • Rate limiting by IP via Upstash sliding window
  • XSS from doc content — HTML-escape before mark insertion, Vitest spec covers it
  • No stack traces to the client — typed error codes only
  • Secrets server-only — GCP credentials never reach the browser or git
  • No LLM output in snippet paths — only embeddings, no generative model

Known gaps

  • In-memory store is ephemeral and not multi-tenant isolated (documented above)
  • No authentication — anyone with the URL can index and search
  • No CSRF token on the index endpoint (mitigated by SameSite cookies if auth is added)
  • Embedding cost not capped per user — only rate-limited by request count

Next step

Want this for your product?

This demo uses an in-memory vector store for zero-dependency simplicity. The production version swaps in Supabase pgvector, adds per-user partitioning, session-keyed rate limits, and an authenticated API. If you have a semantic-search or RAG problem — internal docs, customer knowledge bases, code search — email me with the scale and I'll reply within 24 hours.

Email me →← back to demo