Story

Tell HN: Google increased existing finetuned model latency by 5x

deaux Tuesday, November 25, 2025

Since 5 days ago, the latency of our Finetuned 2.5 Flash models has suddenly jumped by 5x. For those less familiar, such finetuned models are often used to get close to the performance of a big model at one specific task with much less latency and cost. This means they're usually used for realtime, production use cases that see a lot of use and where you want to respond to the user quickly. Otherwise, finetuning generally isn't worth it. Many spend a few thousand dollars (at a minimum) on finetuning a model for one such task.

Five days ago, Google released Nano Banana Pro (Gemini 3.0 Image Preview) to the world. And since five days ago, the latency of our existing finetuned models has suddenly quintupled. We've talked with other startups who also make use of finetuned 2.5 Flash models, and they're seeing the exact same, even those in different regions. Obviously this has a big impact on all of our products.

From Google's side, nothing but silence, and this is talking about paid support. The reply to the initial support ticket is a request for basic information that has already been provided in that ticket or is trivially obvious. Since then, it's been more than 48 hours of nothingness.

Of course the timing could be a pure coincidence - though we've never seen any such latency instability before - but we can all see what's most likely here; Nano Banana Pro and Gemini 3 Preview consuming a huge amount of compute, and they're simply sacrificing finetuned model output for those. It's impossible to take them seriously for business use after this, who knows what they'll do next time. For all their faults, OpenAI have been a bastion of stability, despite being the most B2C-focused of all the frontier model providers. Google with Vertex claims to be all about enterprise and then breaks product of their business customers to get consumers their Ghibli images 1% faster. They've surely gotten plenty of tickets about this, and given Google's engineering, they must have automated monitoring that catches such a huge latency increase immediately. Temporary outages are understandable and happen everywhere, see AWS and Cloudflare recently, but 5+ days - if they even fix it - of 5x latency is effectively a 5+ day outage of a service.

I'm posting this mostly as a warning to other startups here to not rely on Google Vertex for user-facing model needs going forward.

2 0
Read on Hacker News