TL;DR
All three major providers offer a "flex" tier - same models, same API, 50% off. Add service_tier: "flex" to your request, accept slightly higher latency. Good for anything that isn't user-facing chat. You'll need longer timeouts and retry logic.
Most LLM API calls don't need instant responses. If you're running evals, pipelines, or bulk processing - you're paying for priority you don't need.
A pipeline doing 50,000 calls to Gemini 3 Flash (2k input, 1k output tokens each) costs ~100. On Gemini 3.1 Pro, 400. Prompt caching stacks on top.
Calculate your savings
Plug in your numbers. The caching slider compounds with the flex discount - slide it up if you're reusing the same system prompt across calls.
What flex inference is
Standard API calls get priority GPU access. Flex tells the provider "I can wait" - half price in exchange.
It's still synchronous. You get a real-time response, can stream tokens, can chain calls. The only difference is lower queue priority. Google calls it "sheddable compute" - when standard traffic spikes, flex gets bumped. When it's quiet, you barely notice.
All three providers use the same parameter: service_tier set to "flex".
How to enable it
Gemini (April 2026, all 3.x and 2.5 models):
OpenAI (April 2025, gpt-5.5, gpt-5.4 variants, o3, o4-mini):
AWS Bedrock (November 2025, Nova, DeepSeek, Qwen3 - no Claude, no Llama):
If you're using Claude on Bedrock and want a discount, flex won't help - you'd need the fully async Batch API.
What to watch out for
Timeouts. Defaults are too short. Flex requests can queue for 2-3 minutes during busy periods. Set to at least 10 minutes (600_000 ms for Gemini).
Retry with fallback. Peak hours mean 429s (OpenAI) or 503s (Gemini) when capacity runs out. You need backoff + a standard-tier fallback:
Time of day. Early mornings and weekends, flex is near-instant. Tuesday afternoon US time, you'll feel it. Schedule heavy runs accordingly.
No auto-fallback on Gemini. Unlike Priority tier (which degrades to Standard gracefully), flex just fails with 503 when full. Build the fallback yourself.
Caching stacks. Flex discount compounds with prompt caching. Same system prompt across many calls = 50% off the already-discounted cached price.
Flex vs batch
Both are 50% off. Different trade-offs.
| Flex | Batch | |
|---|---|---|
| API | Same endpoint, one parameter | Separate API, JSONL upload |
| Response | Synchronous, seconds to minutes | Async, up to 24 hours |
| Streaming | Yes | No |
| Chaining | Yes | Not practical |
| Code change | One line | Rewrite |
Batch is for 10,000 independent prompts you can leave overnight. Flex is for everything where you want the discount but need synchronous calls - especially pipelines where each step depends on the previous.
Why 50% off is possible
GPU clusters are sized for peak. Off-peak, a lot of that capacity sits idle - powered on, cooled, depreciating, producing nothing.
Same as airlines. The plane flies whether every seat is full or not. Selling an empty seat at half price beats leaving it empty. For GPU providers, flex requests on idle hardware are revenue they'd otherwise never see.
The "sheddable" part is key. No capacity guarantees for flex. If standard spikes, flex gets preempted. No risk of cannibalization - genuinely different needs.
Where it came from
AWS Spot Instances (2009) pioneered selling spare compute at a discount. GCP and Azure followed.
For LLM inference specifically:
- SpotServe (2023, ASPLOS 2024) - first serving system for preemptible GPUs. 54% cost savings.
- SkyServe (2024, Berkeley) - multi-cloud spot serving.
- SageServe (2025, Microsoft) - formalized the Interactive/Non-Interactive/Opportunistic model that maps to Standard/Batch/Flex.
Commercial timeline:
- April 2024 - OpenAI Batch API. First 50% discount on LLM inference.
- October 2024 - Anthropic Message Batches.
- April 2025 - OpenAI Flex Processing. First synchronous discount tier.
- November 2025 - Bedrock Flex and Priority tiers.
- April 2026 - Google Flex Inference for Gemini.
OpenAI launched Flex the day after Google dropped Gemini 2.5 Flash. DeepSeek's cheaper inference earlier that year forced everyone to compete on price.
When to use it
If a human is staring at a screen waiting - use standard. Everything else - flex.
That means: evals, data enrichment, content pipelines, background agents, cron jobs, bulk processing. Where it doesn't work: user-facing chat, real-time features, or Claude/Llama on Bedrock (no flex support - use Batch API instead).
Last updated on