Kshetrajna Raghavan, an applied machine learning engineer at Shopify, presented at a Bay Area DSPy meetup last week on a migration the company ran on its merchant data extraction pipeline. The system processes unstructured storefront data โ product listings, images, descriptions, tax-relevant categorization, fraud signals โ and feeds it into Shopify's downstream automation. The original implementation was a single-prompt setup running on OpenAI's GPT-5. The new one is a multi-agent architecture running on self-hosted Qwen 3, with prompts optimized programmatically via DSPy. The numbers Raghavan presented were a 75x reduction in per-unit LLM cost and roughly a 2x improvement in output quality compared to the GPT-5 single-prompt baseline. Coverage from Analytics India Magazine reported a "68% cheaper" headline that does not match the meetup figure; the 75x number is the one from the source.
The cost reduction is real but worth decomposing because two changes are mixed together. One change is the model swap: GPT-5 API calls are expensive, and self-hosting an open-weights Qwen 3 deployment removes both per-token API pricing and the vendor markup baked into commercial inference. That alone gets you a large multiple in cost. The other change is the architecture swap: moving from a single 5K-token prompt to a multi-agent pipeline with specialized workflows โ Raghavan called out fraud detection and tax coding as separate agents โ and using DSPy to compile and optimize prompts rather than hand-tuning them. The architecture change improves both quality and per-task cost because each agent gets a focused, smaller prompt rather than one giant one that pays for context every call. Saying "Qwen 3 is 75x cheaper than GPT-5" elides this; the real claim is "self-hosted Qwen 3 plus DSPy plus multi-agent decomposition is 75x cheaper than single-prompt GPT-5 on this specific workload."
For builders looking at the same migration, the lessons that generalize are concrete. Self-hosting open-weights at the 32B-parameter scale is now a practical option for high-volume bulk extraction workloads where API spend dominates the budget โ Shopify's pipeline is exactly that shape. DSPy as the prompt-optimization framework is doing real work here; the meetup framing was that hand-engineered prompts on a smaller model would not have closed the quality gap, and that programmatic prompt compilation was what made the smaller model competitive. Multi-agent decomposition trades a single complex prompt for several simpler ones with their own optimization loops, which the article notes is computationally cheaper because each inference is shorter. The combination is the point. Anyone trying just the model swap without the framework and architectural changes will not see a 75x improvement.
The honest caveats are also worth naming. There is no published paper. Hardware specifications for the self-hosted deployment are not disclosed, which matters because the per-unit cost number depends entirely on utilization rates. The 2x quality claim is against a single-prompt GPT-5 baseline that Shopify themselves acknowledge was not tuned with the same care as the new pipeline, so the comparison is between an underinvested old system and an over-invested new one. The migration almost certainly looks better than a fairer baseline would. None of this makes the result wrong, but it does mean the right interpretation is "Shopify's specific workload, with their specific volume, on their specific hardware, with their team's specific DSPy expertise, runs 75x cheaper after this rework." Whether your workload generalizes that well is the question every team considering the same playbook needs to answer for itself.
