Engineering notes
How semantic search actually works.
← back to demoOne Vercel deploy, no external vector database. Every search runs as a Node.js serverless function. Below: the complete pipeline from raw text to ranked, highlighted catalog cards — plus the engineering decisions and the gaps that are honestly disclosed.
Pipeline diagram
┌───────────────────────────────────────────────────────────────────┐
│ POST /api/index │
│ │
│ { docs: [ { title, content, url? } … ] } ← JSON body │
│ │ │
│ ▼ │
│ [01] Zod validation │
│ • max 50 docs • max 200,000 total chars │
│ • title max 500 chars, content max 100,000 chars │
│ │ │
│ ▼ │
│ [02] Rate limit — Upstash sliding window 100 req/day/IP │
│ │ │
│ ▼ │
│ [03] Chunk text — ~500 tok / 50 tok overlap │
│ sentence-boundary preference, word-boundary fallback │
│ │ │
│ ▼ │
│ [04] Batch embed — Vertex text-embedding-004 (768-dim) │
│ up to 10 chunks / 8 000 est-tokens per batch │
│ exponential backoff on 429 / UNAVAILABLE │
│ │ │
│ ▼ │
│ [05] Store — in-memory module-level array of StoredChunk │
│ { id, docIndex, chunkIndex, title, url, text, vector } │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ POST /api/search │
│ │
│ { query: string 2–300 chars } ← JSON body │
│ │ │
│ ▼ │
│ [01] Zod validation + rate limit (shared window) │
│ │ │
│ ▼ │
│ [02] Embed query — Vertex text-embedding-004 │
│ │ │
│ ▼ │
│ [03] Cosine similarity over all stored chunk vectors │
│ score = (A·B) / (|A||B|) │
│ │ │
│ ▼ │
│ [04] Deduplicate — one result per source document (best chunk) │
│ Return top-5 by score │
│ │ │
│ ▼ │
│ [05] Highlight — HTML-escape first, then insert <mark> tags │
│ ESCAPE → MARK order prevents XSS from doc content │
│ │ │
│ ▼ │
│ { results: [ { title, url?, snippet (HTML), score } ] } │
└───────────────────────────────────────────────────────────────────┘The pipeline, step by step.
- 01
Zod input validation
Every API request is parsed through a strict zod schema before any processing begins. For/api/index: maximum 50 documents, 200,000 total characters, 100,000 chars per document, 500-char title cap. For/api/search: query must be 2–300 characters. Malformed or oversized requests are rejected immediately with a typed error code — no stack traces reach the client. - 02
Rate limiting
Upstash Redis sliding-window limiter: 100 requests per IP per 24 hours, prefixrl:search. Gracefully degrades to a no-op when Upstash is not configured (development mode), so the app never hard-crashes on missing env vars. The IP is read fromx-forwarded-for(first hop only) orx-real-ip. - 03
Text chunking
Long documents are split into ~500-token chunks (≈2000 chars) with a 50-token overlap (≈200 chars). Chunking prefers sentence boundaries (". "), falling back to word boundaries, keeping semantically coherent windows. Overlap ensures that a concept split across a boundary still appears in full in at least one chunk. - 04
Batched embedding via Vertex AI
Chunks are embedded in batches of up to 10 items / 8,000 estimated tokens using Google'stext-embedding-004model (768-dimensional output). Batching is necessary because Vertex has per-request token limits. The estimator is deliberately pessimistic (chars/2) to handle Vietnamese, CJK, and dense technical text without overrunning the quota. Transient errors (429, RESOURCE_EXHAUSTED, UNAVAILABLE) trigger exponential backoff up to 5 retries. - 05
In-memory vector store
Vectors are stored in a module-level array in the serverless function. EachStoredChunkcarries its original text, title, optional URL, and the full 768-dimensional float vector. The store is ephemeral: it resets on cold start, instance recycling, or redeployment. The production upgrade path is Supabase pgvector — the cosine similarity becomes a<=>operator call; the top-k is a single indexed ANN scan. - 06
Cosine similarity ranking
The query is embedded with the same model. Cosine similarity is computed against every stored chunk vector:score = (A·B) / (|A|·|B|). Chunks are ranked descending; one result per source document is returned (best-scoring chunk wins), preventing a single long document from dominating the top-5 results. - 07
XSS-safe snippet extraction
This is the most security-critical step. Document content is untrusted — it may contain<script>,<img onerror=…>, or other injection payloads. The pipeline always HTML-escapes the raw text window before inserting any<mark>tags. Escape-then-mark means the<mark>tags are constructed by the application, never from document content. A Vitest spec verifies this invariant.
Security stance
What is defended. What is not.
Defended
- Input size caps (docs, chars, query length) — enforced by zod
- Rate limiting by IP via Upstash sliding window
- XSS from doc content — HTML-escape before mark insertion, Vitest spec covers it
- No stack traces to the client — typed error codes only
- Secrets server-only — GCP credentials never reach the browser or git
- No LLM output in snippet paths — only embeddings, no generative model
Known gaps
- In-memory store is ephemeral and not multi-tenant isolated (documented above)
- No authentication — anyone with the URL can index and search
- No CSRF token on the index endpoint (mitigated by SameSite cookies if auth is added)
- Embedding cost not capped per user — only rate-limited by request count
Next step
Want this for your product?
This demo uses an in-memory vector store for zero-dependency simplicity. The production version swaps in Supabase pgvector, adds per-user partitioning, session-keyed rate limits, and an authenticated API. If you have a semantic-search or RAG problem — internal docs, customer knowledge bases, code search — email me with the scale and I'll reply within 24 hours.