Skip to main content

Envoy AI Gateway v0.5.x

Multi-gateway configuration, prompt caching cost savings, fine-grained MCP authorization, OpenAI Responses API, and Google Search grounding for Gemini

v0.5.0

January 23, 2026
GatewayConfig CRDOpenAI Responses APIAWS Bedrock CachingCEL AuthorizationMCP Stdio ServersGoogle Search GroundingBody Mutation
Envoy AI Gateway v0.5.0 makes multi-gateway deployments easier with the new GatewayConfig CRD, cuts costs with prompt caching for AWS Bedrock and GCP Claude, and unlocks fine-grained access control with CEL-based MCP authorization. Developers gain OpenAI Responses API support, Google Search grounding for Gemini, and the ability to mutate request bodies per-route. Under the hood, the switch to sonic JSON processing reduces latency across all requests.

✨ New Features

Gateway Configuration

New GatewayConfig CRD

Introduces a new GatewayConfig custom resource for gateway-scoped configuration. Reference it from a Gateway via the aigateway.envoyproxy.io/gateway-config annotation to configure the external processor container with environment variables, resource requirements, and other Kubernetes container settings. Multiple Gateways can share the same GatewayConfig for consistent configuration.

Configurable endpoint prefixes

Added prefix field to VersionedAPISchema for configuring API endpoint prefixes. Useful for routing to backends with OpenAI-compatible APIs that use non-standard prefixes, such as Gemini's /v1beta/openai or Cohere's /compatibility/v1.

OpenAI API Support

OpenAI Responses API /v1/responses

Full support for OpenAI's new Responses API endpoint with streaming and non-streaming modes, function calling, MCP tools support, reasoning, multi-turn conversations, and native multimodal capabilities. Includes complete token usage tracking and OpenInference tracing.

Provider Caching Enhancements

Prompt caching for AWS Bedrock Claude

Reduce costs and latency by reusing cached system prompts with Bedrock Anthropic models. The gateway handles cache point markers automatically and tracks both cache creation and cache hit tokens separately for accurate billing visibility.

Prompt caching for GCP Vertex AI Claude

Same cost-saving prompt caching now works with Claude models on GCP Vertex AI. Cache your system prompts and few-shot examples to cut input token costs on repeated requests.

MCP Gateway Enhancements

Fine-grained authorization with CEL, JWT claims, and external auth

Comprehensive authorization system for MCP routes. Write expressive CEL rules using request attributes (HTTP method, headers, JWT claims, tool names and call arguments), enforce access based on JWT claim values from your identity provider, or delegate to external gRPC/HTTP authorization services. Control which users can access which tools with precision.

Real-time tool list synchronization

MCP clients automatically stay in sync when you update MCPRoutes. The gateway sends notifications/tools/list_changed notifications, so clients refresh their available tools without manual intervention or reconnection.

Stdio server proxy in standalone mode

Run command-line MCP tools (like npx-based servers) without code changes. The aigw CLI proxies stdio-based MCP servers over HTTP, letting you integrate CLI tools directly into your AI workflows.

Improved OAuth metadata discovery

OAuth well-known endpoints now serve at the MCPRoute path prefix, ensuring MCP clients correctly discover authorization configuration even when multiple routes have different security policies.

Inference Extension

Security policies for inference pools

Apply BackendSecurityPolicy to InferencePool resources, ensuring consistent authentication across all dynamically-selected inference endpoints in your pool.

Gemini Provider Enhancements

Ground responses with live web search

Give Gemini models access to real-time information via the google_search tool type. Filter results by domain, set blocking confidence thresholds, and restrict time ranges to ensure responses use current, relevant sources.

Consistent thinking configuration across providers

Use the same thinking configuration for both Anthropic and Gemini models. Write provider-agnostic code that works with extended thinking features regardless of which model backend processes the request.

Gemini 3 reasoning and image quality controls

Fine-tune Gemini 3 behavior with thinking_level (control reasoning depth) and media_resolution (balance image quality vs. speed). These settings gracefully degrade on older Gemini versions.

Visibility into model reasoning

When thinking is enabled, the gateway extracts and surfaces thought summaries from Gemini responses, helping you understand how the model reached its conclusions.

Enterprise web search integration

Connect Gemini to your organization's search infrastructure with the enterprise_search tool type. Configure custom data sources for grounding responses in enterprise-specific knowledge.

Traffic Management

Route-level body mutation

Inject or remove JSON fields in request bodies per-backend without application changes. Use bodyMutation to add service_tier, set custom parameters, or strip internal fields before requests reach different providers. Route-level settings override backend defaults for flexible multi-backend configurations.

AWS Bedrock service tier control

Choose between standard, flex, priority, and reserved for Bedrock requests. Set service_tier to prioritize latency-sensitive workloads or optimize costs for batch processing, with automatic fallback handling.

Observability Enhancements

Per-provider cost attribution

Track spending across AI providers with the new gen_ai.provider.name metric attribute. Filter dashboards and alerts by provider to identify cost drivers and optimize your multi-provider strategy.

Full tracing for Anthropic Messages API

Debug Claude model requests end-to-end with OpenInference-compliant tracing for the native /messages endpoint. View prompts, responses, and timing in Arize Phoenix or any OpenTelemetry-compatible observability platform.

Cohere Rerank visibility

Track reranking operations in your traces. Full OpenTelemetry support for Cohere's v2 rerank endpoint captures query, documents, and relevance scores for debugging RAG pipelines.

Performance and Operations

Faster request processing with sonic JSON

Reduced gateway latency by migrating to bytedance/sonic for JSON encoding and decoding. Expect measurable latency improvements and lower CPU usage, especially for large payloads and high request volumes.

Faster cross-namespace reference validation

Reduced controller reconciliation time with optimized ReferenceGrant indexing. Cross-namespace backend references now validate faster, improving startup time and configuration change responsiveness.

Improved MCP proxy throughput

MCP proxy now reuses HTTP connections across requests, eliminating per-request connection overhead. This significantly improves throughput when proxying to multiple MCP backend servers. View the MCP performance blog post for details →

🔗 API Updates

  • New GatewayConfig CRD: New custom resource for gateway-level configuration. Spec includes extProc.kubernetes for container settings (resources, env vars, image overrides). Reference from Gateway via aigateway.envoyproxy.io/gateway-config annotation.
  • VersionedAPISchema.prefix: New prefix field replaces overloading version for endpoint path customization. Example: prefix: "/v1beta/openai" for Gemini's OpenAI-compatible API.
  • AIGatewayRouteRuleBackendRef.bodyMutation: New field with set (list of field/value pairs) and remove (list of field names) for request body manipulation.
  • LLMRequestCostType.CacheCreationInputToken: New cost type for tracking tokens written to cache, separate from CachedInputToken (tokens read from cache).
  • MCPRouteSecurityPolicy authorization fields: New authorization block with defaultAction (Allow/Deny), rules array supporting cel expressions, JWT scopes/claims, and tools targeting. New extAuth field for external authorization delegation.
  • BackendSecurityPolicy.targetRefs expansion: Now accepts InferencePool (group: inference.networking.x-k8s.io) in addition to AIServiceBackend.

Deprecations

  • AIGatewayFilterConfigExternalProcessor.resources: The resources field in AIGatewayFilterConfigExternalProcessor is deprecated. Use GatewayConfig for gateway-scoped resource configuration instead. This field will be removed in the next v0.6.
  • version field as prefix for OpenAI schema: Using the version field as a prefix for OpenAI schema is deprecated. Use the new prefix field instead. The legacy behavior will be removed in the next v0.6.

🐛 Bug Fixes

  • AWS Bedrock Claude streaming reliability: Streaming responses from Bedrock Claude models now complete correctly. Previously, some streamed responses could be truncated or malformed.
  • Gemini streaming token counts: Token usage in Gemini streaming responses now matches OpenAI format, fixing integrations that parse usage from stream chunks.
  • Multi-chunk Gemini tool calls: Tool calls that span multiple streaming chunks now have correct indices, preventing function calling errors with Gemini.
  • GCP Claude reasoning content: Reasoning/thinking content now correctly passes through for Claude models on GCP Vertex AI in both requests and responses.
  • Zero-weight backend references: Backend references with zero weight no longer cause routing errors, allowing gradual traffic shifts via weight changes.
  • Umbrella chart image pull secrets: Helm deployments within umbrella charts now correctly inherit global.imagePullSecrets when chart-level secrets aren't set.
  • GCP global region backends: Vertex AI backends configured with global region now work correctly instead of failing during setup.
  • Accurate per-token latency metrics: Fixed integer truncation in time_per_output_token calculation that caused incorrect latency measurements for fast responses.
  • Anthropic token counting: Improved accuracy of input and output token counts for Anthropic models, ensuring billing metrics match provider reports.

📖 Upgrade Guidance

Migrating to GatewayConfig

If you're using AIGatewayFilterConfigExternalProcessor.resources for container resource configuration, migrate to the new GatewayConfig CRD:

  1. Create a GatewayConfig resource with your desired configuration:
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: GatewayConfig
metadata:
name: my-gateway-config
namespace: default
spec:
extProc:
kubernetes:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
  1. Reference the GatewayConfig from your Gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ai-gateway
annotations:
aigateway.envoyproxy.io/gateway-config: my-gateway-config

Migrating Endpoint Prefix Configuration

If you're using the version field as a prefix for OpenAI-compatible backends, migrate to the new prefix field:

Before:

schema:
name: OpenAI
version: "/v1beta/openai" # Deprecated usage

After:

schema:
name: OpenAI
prefix: "/v1beta/openai"

📦 Dependencies Versions

Go 1.25.6

Updated to Go 1.25.6 for improved performance and security fixes.

Envoy Gateway v1.6

Built on Envoy Gateway v1.6 for enhanced data plane capabilities and stability improvements.

Envoy v1.36

Leveraging Envoy Proxy v1.36.4 with the latest networking and security features.

Gateway API v1.4.0

Support for Gateway API v1.4.0 specifications.

Gateway API Inference Extension v1.0.2

Integration with Gateway API Inference Extension v1.0.2 for stable intelligent endpoint selection.

⏩ Patch Releases

🙏 Acknowledgements

We extend our gratitude to all contributors who made this release possible. Special thanks to:

  • The growing community of adopters including Bloomberg, LY Corporation, Alan by Comma Soft, and NRP for their valuable feedback and production insights
  • Everyone who reported bugs, submitted PRs, and participated in design discussions
  • The Envoy Gateway team for their continued collaboration

🔮 What's Next

We're already working on exciting features for future releases:

  • Additional provider integrations - AWS Bedrock InvokeModel API support for Claude and GPT models, Gemini embeddings, and Azure/AKS workload identity
  • Batch inference APIs - Support for batch processing to improve throughput for high-volume workloads
  • Advanced caching strategies - Prompt cache key and retention controls for OpenAI chat completions
  • Upstream provider quota policies - New API for managing upstream provider quotas
  • Sensitive data redaction - Request and response body redaction for protecting sensitive information