The Netorigo Storefront runs two AI-driven product recommenders: one at the bottom of every PDP ("You may also like") and one inside the chat panel ("Find me a product for X"). They do completely different jobs, and early on we conflated them. Below is when each wins, and the hybrid that delivered the round number.
The two surfaces
PDP "You may also like": early-stage, statistical
The similar-product list at the bottom of a PDP is NOT LLM-generated. A precomputed collaborative-filtering matrix (product_affinity) feeds it, refreshed nightly. Latency: 12 ms p99 (the rows arrive in the same JOIN as the product query). Token cost: zero. No hallucination risk.
Upside: fast, cheap, deterministic. Downside: only works when data exists. A product less than 30 days in the catalog has cold-start: the list is empty or directionless.
Chat-driven "Find me a product": final-stage, LLM-driven
When the visitor types "I need a laptop for 3D modelling under 800k HUF" into the chat panel, the engine invokes the catalog.searchProducts MCP tool, gets back a 50-item shortlist, and an LLM ranks it against the user intent. Latency: 2-4 seconds. Token cost: 1,100-1,800 per session (GPT-4.1-mini, $0.0015-$0.0025).
Upside: meaningful answers to meaningful questions. Downside: slower, costlier, sometimes too confident about a wrong recommendation.
When each wins
- PDP infinity scroll: early-stage every time. Nobody waits 3 seconds for a similar-product list.
- "What is this good for?": final-stage every time. Statistics do not know a hair clipper is also good for beard care.
- Browse-style "show me some options": hybrid (below).
The hybrid recipe
The chat panel's first move is always statistical. The catalog.searchProducts MCP tool returns the top 5 affinity-matched products, and the LLM presents them with a ~200-token wrap. We only invoke the expensive LLM ranking pass when the user rejects ("I don't like any of those" / "something else" / asks a follow-up). That keeps 71% of sessions on the cheap path, and average token usage dropped from 1,800 to 540 (-70%).
The measurable number: +7.8% AOV
On a partner's technical-equipment vertical (B2B workshop machines, average basket around 380k HUF), we ran a 90-day A/B. Control: statistical only. Treatment: hybrid (statistical plus chat-driven LLM ranking). The AOV moved up 7.8% in the treatment arm (n=4,827). Conversion rates did not move significantly — meaning we did not get MORE buyers, the buyers bought MORE.
The hypothesis it confirmed: the chat-driven recommender suggested complementary items (mounts, cables, consumables) the buyer would not have considered. Classic basket-completion upsell, except not a hand-curated coupon — a semantic AI.
The token-cost math
4,827 sessions * 540 average tokens * $0.0015 / 1k = $3.91 total session cost. The AOV uplift: +29,800 HUF / order * 4,827 * 0.078 = ~11 million HUF additional revenue. ROI on token cost: roughly 870,000x. That is not a typo.
Model swappability architecture
The chat-driven recommender does not hardcode the model. The engine.providers.<tenantId> config table holds the priority order, default: openai/gpt-4.1-mini, fallback anthropic/claude-haiku-4, last-resort openai/gpt-4o-mini. The circuit breaker tracks each provider independently: 5 errors in a 60-second window and the provider drops out for 5 minutes. The switch is transparent to the panel — the chat panel only sees a session.kind discriminator.
Why not just go fully statistical
The partner data team initially asked us to keep the recommender fully static (off the product_affinity matrix). Argument: cheap, predictable, no hallucination. Counter-argument: long-tail questions (per-buyer unique intents) cannot be captured by a static rule. Who tabulates "a setup for greenhouse cooling at 35-degree southern exposure" in a static rule? Nobody. The LLM earns its keep precisely on long-tail intents — plus it makes the recommender feel human, like a knowledgeable shop assistant.