Tools & TechMay 12, 20264 min read

Flex Inference: 50% Off LLM Calls on Gemini, OpenAI, and Bedrock

Every major AI provider now offers half-price inference if you can tolerate a few extra seconds of latency. One parameter change. Same API. Here's how it works and why.

LLMsinferencecost-optimizationGeminiOpenAIAWStool

TL;DR

All three major providers offer a "flex" tier - same models, same API, 50% off. Add service_tier: "flex" to your request, accept slightly higher latency. Good for anything that isn't user-facing chat. You'll need longer timeouts and retry logic.

Most LLM API calls don't need instant responses. If you're running evals, pipelines, or bulk processing - you're paying for priority you don't need.

A pipeline doing 50,000 calls to Gemini 3 Flash (2k input, 1k output tokens each) costs ~200onstandard.Withflex,200 on standard. With flex, 100. On Gemini 3.1 Pro, 800becomes800 becomes 400. Prompt caching stacks on top.

Calculate your savings

Plug in your numbers. The caching slider compounds with the flex discount - slide it up if you're reusing the same system prompt across calls.

Cached input0%
Gemini 3.1 Flash-Lite$100.00$50.00(-$50.00)
Gemini 3 Flash$200.00$100.00(-$100.00)
Gemini 3.1 Pro$800.00$400.00(-$400.00)
Gemini 2.5 Flash$155.00$77.50(-$77.50)
Gemini 2.5 Pro$625.00$312.50(-$312.50)
OpenAI gpt-5.4-mini$300.00$150.00(-$150.00)
OpenAI gpt-5.4$1,000.00$500.00(-$500.00)
OpenAI gpt-5.5$2,000.00$1,000.00(-$1,000.00)
OpenAI o3$3,000.00$1,500.00(-$1,500.00)
Flex cost Saved

What flex inference is

Standard API calls get priority GPU access. Flex tells the provider "I can wait" - half price in exchange.

It's still synchronous. You get a real-time response, can stream tokens, can chain calls. The only difference is lower queue priority. Google calls it "sheddable compute" - when standard traffic spikes, flex gets bumped. When it's quiet, you barely notice.

All three providers use the same parameter: service_tier set to "flex".

How to enable it

Gemini (April 2026, all 3.x and 2.5 models):

from google import genai
 
client = genai.Client()
 
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Summarize this document...",
    config={
        "service_tier": "flex",
        "http_options": {"timeout": 600_000},  # 10 min, in ms
    },
)

OpenAI (April 2025, gpt-5.5, gpt-5.4 variants, o3, o4-mini):

from openai import OpenAI
 
client = OpenAI()
 
response = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    service_tier="flex",
)

AWS Bedrock (November 2025, Nova, DeepSeek, Qwen3 - no Claude, no Llama):

import boto3
 
client = boto3.client("bedrock-runtime")
 
response = client.converse(
    modelId="amazon.nova-pro-v1:0",
    messages=[{"role": "user", "content": [{"text": "Summarize this..."}]}],
    serviceTier={"type": "flex"},
)

If you're using Claude on Bedrock and want a discount, flex won't help - you'd need the fully async Batch API.

What to watch out for

Timeouts. Defaults are too short. Flex requests can queue for 2-3 minutes during busy periods. Set to at least 10 minutes (600_000 ms for Gemini).

Retry with fallback. Peak hours mean 429s (OpenAI) or 503s (Gemini) when capacity runs out. You need backoff + a standard-tier fallback:

import time
from google.genai import errors
 
def flex_with_fallback(client, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.models.generate_content(
                model="gemini-3-flash-preview",
                contents=prompt,
                config={"service_tier": "flex"},
            )
        except errors.APIError:
            if attempt < max_retries - 1:
                time.sleep(5 * (2 ** attempt))
            else:
                return client.models.generate_content(
                    model="gemini-3-flash-preview",
                    contents=prompt,
                )

Time of day. Early mornings and weekends, flex is near-instant. Tuesday afternoon US time, you'll feel it. Schedule heavy runs accordingly.

No auto-fallback on Gemini. Unlike Priority tier (which degrades to Standard gracefully), flex just fails with 503 when full. Build the fallback yourself.

Caching stacks. Flex discount compounds with prompt caching. Same system prompt across many calls = 50% off the already-discounted cached price.

Flex vs batch

Both are 50% off. Different trade-offs.

FlexBatch
APISame endpoint, one parameterSeparate API, JSONL upload
ResponseSynchronous, seconds to minutesAsync, up to 24 hours
StreamingYesNo
ChainingYesNot practical
Code changeOne lineRewrite

Batch is for 10,000 independent prompts you can leave overnight. Flex is for everything where you want the discount but need synchronous calls - especially pipelines where each step depends on the previous.

Why 50% off is possible

GPU clusters are sized for peak. Off-peak, a lot of that capacity sits idle - powered on, cooled, depreciating, producing nothing.

Same as airlines. The plane flies whether every seat is full or not. Selling an empty seat at half price beats leaving it empty. For GPU providers, flex requests on idle hardware are revenue they'd otherwise never see.

The "sheddable" part is key. No capacity guarantees for flex. If standard spikes, flex gets preempted. No risk of cannibalization - genuinely different needs.

Where it came from

AWS Spot Instances (2009) pioneered selling spare compute at a discount. GCP and Azure followed.

For LLM inference specifically:

  • SpotServe (2023, ASPLOS 2024) - first serving system for preemptible GPUs. 54% cost savings.
  • SkyServe (2024, Berkeley) - multi-cloud spot serving.
  • SageServe (2025, Microsoft) - formalized the Interactive/Non-Interactive/Opportunistic model that maps to Standard/Batch/Flex.

Commercial timeline:

  • April 2024 - OpenAI Batch API. First 50% discount on LLM inference.
  • October 2024 - Anthropic Message Batches.
  • April 2025 - OpenAI Flex Processing. First synchronous discount tier.
  • November 2025 - Bedrock Flex and Priority tiers.
  • April 2026 - Google Flex Inference for Gemini.

OpenAI launched Flex the day after Google dropped Gemini 2.5 Flash. DeepSeek's cheaper inference earlier that year forced everyone to compete on price.

When to use it

If a human is staring at a screen waiting - use standard. Everything else - flex.

That means: evals, data enrichment, content pipelines, background agents, cron jobs, bulk processing. Where it doesn't work: user-facing chat, real-time features, or Claude/Llama on Bedrock (no flex support - use Batch API instead).

Last updated on

On this page