Engineering notes

How semantic search actually works.

One Vercel deploy, no external vector database. Every search runs as a Node.js serverless function. Below: the complete pipeline from raw text to ranked, highlighted catalog cards — plus the engineering decisions and the gaps that are honestly disclosed.

Pipeline diagram

┌───────────────────────────────────────────────────────────────────┐
  │  POST /api/index                                                   │
  │                                                                   │
  │  { docs: [ { title, content, url? } … ] }   ← JSON body          │
  │          │                                                        │
  │          ▼                                                        │
  │  [01] Zod validation                                              │
  │       • max 50 docs  • max 200,000 total chars                    │
  │       • title max 500 chars, content max 100,000 chars            │
  │          │                                                        │
  │          ▼                                                        │
  │  [02] Rate limit — Upstash sliding window 100 req/day/IP          │
  │          │                                                        │
  │          ▼                                                        │
  │  [03] Chunk text — ~500 tok / 50 tok overlap                      │
  │       sentence-boundary preference, word-boundary fallback        │
  │          │                                                        │
  │          ▼                                                        │
  │  [04] Batch embed — Vertex text-embedding-004 (768-dim)           │
  │       up to 10 chunks / 8 000 est-tokens per batch                │
  │       exponential backoff on 429 / UNAVAILABLE                    │
  │          │                                                        │
  │          ▼                                                        │
  │  [05] Store — in-memory module-level array of StoredChunk         │
  │       { id, docIndex, chunkIndex, title, url, text, vector }      │
  └───────────────────────────────────────────────────────────────────┘

  ┌───────────────────────────────────────────────────────────────────┐
  │  POST /api/search                                                  │
  │                                                                   │
  │  { query: string 2–300 chars }              ← JSON body          │
  │          │                                                        │
  │          ▼                                                        │
  │  [01] Zod validation + rate limit (shared window)                 │
  │          │                                                        │
  │          ▼                                                        │
  │  [02] Embed query — Vertex text-embedding-004                     │
  │          │                                                        │
  │          ▼                                                        │
  │  [03] Cosine similarity over all stored chunk vectors             │
  │       score = (A·B) / (|A||B|)                                    │
  │          │                                                        │
  │          ▼                                                        │
  │  [04] Deduplicate — one result per source document (best chunk)   │
  │       Return top-5 by score                                       │
  │          │                                                        │
  │          ▼                                                        │
  │  [05] Highlight — HTML-escape first, then insert <mark> tags      │
  │       ESCAPE → MARK order prevents XSS from doc content           │
  │          │                                                        │
  │          ▼                                                        │
  │  { results: [ { title, url?, snippet (HTML), score } ] }         │
  └───────────────────────────────────────────────────────────────────┘

The pipeline, step by step.

01
Zod input validation
Every API request is parsed through a strict zod schema before any processing begins. For /api/index: maximum 50 documents, 200,000 total characters, 100,000 chars per document, 500-char title cap. For /api/search: query must be 2–300 characters. Malformed or oversized requests are rejected immediately with a typed error code — no stack traces reach the client.
02
Rate limiting
Upstash Redis sliding-window limiter: 100 requests per IP per 24 hours, prefix rl:search. Gracefully degrades to a no-op when Upstash is not configured (development mode), so the app never hard-crashes on missing env vars. The IP is read from x-forwarded-for (first hop only) or x-real-ip.
03
Text chunking
Long documents are split into ~500-token chunks (≈2000 chars) with a 50-token overlap (≈200 chars). Chunking prefers sentence boundaries (". "), falling back to word boundaries, keeping semantically coherent windows. Overlap ensures that a concept split across a boundary still appears in full in at least one chunk.
04
Batched embedding via Vertex AI
Chunks are embedded in batches of up to 10 items / 8,000 estimated tokens using Google's text-embedding-004 model (768-dimensional output). Batching is necessary because Vertex has per-request token limits. The estimator is deliberately pessimistic (chars/2) to handle Vietnamese, CJK, and dense technical text without overrunning the quota. Transient errors (429, RESOURCE_EXHAUSTED, UNAVAILABLE) trigger exponential backoff up to 5 retries.
05
In-memory vector store
Vectors are stored in a module-level array in the serverless function. Each StoredChunk carries its original text, title, optional URL, and the full 768-dimensional float vector. The store is ephemeral: it resets on cold start, instance recycling, or redeployment. The production upgrade path is Supabase pgvector — the cosine similarity becomes a <=> operator call; the top-k is a single indexed ANN scan.
06
Cosine similarity ranking
The query is embedded with the same model. Cosine similarity is computed against every stored chunk vector: score = (A·B) / (|A|·|B|). Chunks are ranked descending; one result per source document is returned (best-scoring chunk wins), preventing a single long document from dominating the top-5 results.
07
XSS-safe snippet extraction
This is the most security-critical step. Document content is untrusted — it may contain <script>, <img onerror=…>, or other injection payloads. The pipeline always HTML-escapes the raw text window before inserting any <mark> tags. Escape-then-mark means the <mark> tags are constructed by the application, never from document content. A Vitest spec verifies this invariant.

Vector store: ephemeral by design

The in-memory store intentionally has no external dependency for this demo — one pnpm dev is all you need to run it locally. The trade-off: state is lost on every cold start and is not shared across Vercel instances. For production use, swap src/lib/store.ts for a Supabase pgvector implementation (see comments in that file).

Multi-tenant isolation gap

Known gap: the in-memory store has no user isolation. All documents indexed on the same serverless instance are queryable by all users on that instance. For sensitive or multi-tenant use, partition by session/user ID in pgvector, or use a per-user ephemeral store keyed in Redis.

Embedding model

Google Vertex AI text-embedding-004 supports up to 2048 input tokens and produces 768-dimensional output vectors. Suitable for semantic retrieval; outperforms older 512-dim models on most English and multilingual benchmarks. Authentication uses Application Default Credentials (ADC) for local dev and a service-account JSON injected as an env var on Vercel — never committed to source control.

XSS safety proof

The src/lib/highlight.ts module follows a strict contract: (1) HTML-escape all text from doc content; (2) then insert <mark> tags around matched query terms. The Vitest suite (15 tests) covers: script injection, img onerror, javascript: hrefs, amp/gt/lt escaping, case-insensitive match wrapping, ellipsis on long text, and all edge cases for cosineSimilarity().

Security stance

What is defended. What is not.

Defended

Input size caps (docs, chars, query length) — enforced by zod
Rate limiting by IP via Upstash sliding window
XSS from doc content — HTML-escape before mark insertion, Vitest spec covers it
No stack traces to the client — typed error codes only
Secrets server-only — GCP credentials never reach the browser or git
No LLM output in snippet paths — only embeddings, no generative model

Known gaps

In-memory store is ephemeral and not multi-tenant isolated (documented above)
No authentication — anyone with the URL can index and search
No CSRF token on the index endpoint (mitigated by SameSite cookies if auth is added)
Embedding cost not capped per user — only rate-limited by request count

Next step

Want this for your product?

This demo uses an in-memory vector store for zero-dependency simplicity. The production version swaps in Supabase pgvector, adds per-user partitioning, session-keyed rate limits, and an authenticated API. If you have a semantic-search or RAG problem — internal docs, customer knowledge bases, code search — email me with the scale and I'll reply within 24 hours.

Email me →← back to demo

The pipeline, step by step.

Zod input validation

Rate limiting

Text chunking

Batched embedding via Vertex AI

In-memory vector store

Cosine similarity ranking

XSS-safe snippet extraction

What is defended. What is not.

Want this for your product?