LLM Applications: Beyond the Thin Wrapper

The gap between "ChatGPT wrapper" and a real LLM product isn't the model. Everyone has the same models. The difference is everything around the call.

Context is the moat

In Officer Saathi, we used embeddings for semantic matching between professionals. The model was off-the-shelf (text-embedding-3-small); the value was in what we embedded — carefully structured profiles, weighted interests, and domain-specific signals nobody else had.

Your proprietary data, shaped well, beats a bigger model with generic context. Spend your effort on the retrieval and structuring pipeline, not on prompt poetry.

Cheap models for narrow jobs

Not every task needs the flagship model. We used a small model for generating match explanations — a narrow, low-risk task with a tight prompt. Flagship models go where reasoning genuinely matters. This single decision often cuts inference costs by 10x.

Evals before features

Before adding a new LLM feature, write down 20 real inputs and what a good output looks like for each. Run them on every prompt or model change. It's unglamorous, and it's the only way to know whether you improved things or just changed them.

Without evals, every prompt tweak is a coin flip you can't see landing.

Design for being wrong

LLMs are probabilistic; your UX has to absorb that:

Show why (we displayed what two matched members had in common)
Make the AI's output a draft, not an action — humans confirm anything consequential
Give an escape hatch: regenerate, edit, or skip

Products fail when they present model output with database-level confidence. Users forgive a suggestion that's off; they don't forgive a system that's confidently wrong.

The pattern across all of this: treat the model as one unreliable-but-brilliant component in a system you control, not as the system itself.