The first version of the Passport Inquiry Chatbot was demoable and dangerous. Ask it a question it knew, and it answered well. Ask it something just outside the official documents, and it answered just as confidently — and just as wrongly. For a government service, a fluent wrong answer is worse than no answer. Here is how I made it trustworthy. ## Hallucination is a retrieval problem first It is tempting to treat hallucination as a model flaw to be prompted away. In practice, most production hallucination is the model improvising because the right context never reached it. Fix retrieval and a large share of the problem disappears before the prompt even runs. ## Chunking that respects meaning I stopped splitting documents on a fixed character count and started splitting on structure — sections, headings, and Q&A pairs from the official passport material. Each chunk carries metadata (source document, section) so an answer can cite where it came from. Chunks that respect meaning retrieve far more precisely than chunks that respect byte offsets. ## Retrieve, then constrain The pipeline is deliberately boring: embed the question, pull the top-k chunks, and hand only those to Gemini with a prompt that forbids going beyond them. ``` prompt = ( "Answer ONLY from the context below. " "If the answer is not in the context, say you don't have that information " "and point the user to the official passport office.\n\n" f"CONTEXT:\n{retrieved}\n\nQUESTION: {question}" ) ``` The instruction to admit ignorance is doing heavy lifting. A bot that says "I don't have that information" is infinitely more useful to a citizen than one that invents a fee schedule. ## Evaluation: the part everyone skips "It looks good" is not a metric. I built a small set of question/expected-answer pairs drawn from real inquiries and scored every change against it: retrieval hit rate (did the right chunk come back?), groundedness (is every claim supported by a retrieved chunk?), and refusal accuracy (does it decline when it should?). When a prompt tweak improved one number and quietly hurt another, the eval set caught it. ## What carried over to this portfolio The Digital Twin you can chat with on this site runs the same architecture — retrieve from a known corpus, constrain the model to it, fall back gracefully when there's no match. The corpus is just my own projects and résumé instead of passport regulations. Same discipline, different documents.