Envoy AI Gateway v0.5.x
v0.5.0
✨ New Features
Gateway Configuration
GatewayConfig CRDIntroduces a new GatewayConfig custom resource for gateway-scoped configuration. Reference it from a Gateway via the aigateway.envoyproxy.io/gateway-config annotation to configure the external processor container with environment variables, resource requirements, and other Kubernetes container settings. Multiple Gateways can share the same GatewayConfig for consistent configuration.
Added prefix field to VersionedAPISchema for configuring API endpoint prefixes. Useful for routing to backends with OpenAI-compatible APIs that use non-standard prefixes, such as Gemini's /v1beta/openai or Cohere's /compatibility/v1.
OpenAI API Support
/v1/responsesFull support for OpenAI's new Responses API endpoint with streaming and non-streaming modes, function calling, MCP tools support, reasoning, multi-turn conversations, and native multimodal capabilities. Includes complete token usage tracking and OpenInference tracing.
Provider Caching Enhancements
Reduce costs and latency by reusing cached system prompts with Bedrock Anthropic models. The gateway handles cache point markers automatically and tracks both cache creation and cache hit tokens separately for accurate billing visibility.
Same cost-saving prompt caching now works with Claude models on GCP Vertex AI. Cache your system prompts and few-shot examples to cut input token costs on repeated requests.
MCP Gateway Enhancements
Comprehensive authorization system for MCP routes. Write expressive CEL rules using request attributes (HTTP method, headers, JWT claims, tool names and call arguments), enforce access based on JWT claim values from your identity provider, or delegate to external gRPC/HTTP authorization services. Control which users can access which tools with precision.
MCP clients automatically stay in sync when you update MCPRoutes. The gateway sends notifications/tools/list_changed notifications, so clients refresh their available tools without manual intervention or reconnection.
Run command-line MCP tools (like npx-based servers) without code changes. The aigw CLI proxies stdio-based MCP servers over HTTP, letting you integrate CLI tools directly into your AI workflows.
OAuth well-known endpoints now serve at the MCPRoute path prefix, ensuring MCP clients correctly discover authorization configuration even when multiple routes have different security policies.
Inference Extension
Apply BackendSecurityPolicy to InferencePool resources, ensuring consistent authentication across all dynamically-selected inference endpoints in your pool.
Gemini Provider Enhancements
Give Gemini models access to real-time information via the google_search tool type. Filter results by domain, set blocking confidence thresholds, and restrict time ranges to ensure responses use current, relevant sources.
Use the same thinking configuration for both Anthropic and Gemini models. Write provider-agnostic code that works with extended thinking features regardless of which model backend processes the request.
Fine-tune Gemini 3 behavior with thinking_level (control reasoning depth) and media_resolution (balance image quality vs. speed). These settings gracefully degrade on older Gemini versions.
When thinking is enabled, the gateway extracts and surfaces thought summaries from Gemini responses, helping you understand how the model reached its conclusions.
Connect Gemini to your organization's search infrastructure with the enterprise_search tool type. Configure custom data sources for grounding responses in enterprise-specific knowledge.
Traffic Management
Inject or remove JSON fields in request bodies per-backend without application changes. Use bodyMutation to add service_tier, set custom parameters, or strip internal fields before requests reach different providers. Route-level settings override backend defaults for flexible multi-backend configurations.
Choose between standard, flex, priority, and reserved for Bedrock requests. Set service_tier to prioritize latency-sensitive workloads or optimize costs for batch processing, with automatic fallback handling.
Observability Enhancements
Track spending across AI providers with the new gen_ai.provider.name metric attribute. Filter dashboards and alerts by provider to identify cost drivers and optimize your multi-provider strategy.
Debug Claude model requests end-to-end with OpenInference-compliant tracing for the native /messages endpoint. View prompts, responses, and timing in Arize Phoenix or any OpenTelemetry-compatible observability platform.
Track reranking operations in your traces. Full OpenTelemetry support for Cohere's v2 rerank endpoint captures query, documents, and relevance scores for debugging RAG pipelines.
Performance and Operations
Reduced gateway latency by migrating to bytedance/sonic for JSON encoding and decoding. Expect measurable latency improvements and lower CPU usage, especially for large payloads and high request volumes.
Reduced controller reconciliation time with optimized ReferenceGrant indexing. Cross-namespace backend references now validate faster, improving startup time and configuration change responsiveness.
MCP proxy now reuses HTTP connections across requests, eliminating per-request connection overhead. This significantly improves throughput when proxying to multiple MCP backend servers. View the MCP performance blog post for details →
🔗 API Updates
- New
GatewayConfigCRD: New custom resource for gateway-level configuration. Spec includesextProc.kubernetesfor container settings (resources, env vars, image overrides). Reference from Gateway viaaigateway.envoyproxy.io/gateway-configannotation. VersionedAPISchema.prefix: Newprefixfield replaces overloadingversionfor endpoint path customization. Example:prefix: "/v1beta/openai"for Gemini's OpenAI-compatible API.AIGatewayRouteRuleBackendRef.bodyMutation: New field withset(list of field/value pairs) andremove(list of field names) for request body manipulation.LLMRequestCostType.CacheCreationInputToken: New cost type for tracking tokens written to cache, separate fromCachedInputToken(tokens read from cache).MCPRouteSecurityPolicyauthorization fields: Newauthorizationblock withdefaultAction(Allow/Deny),rulesarray supportingcelexpressions, JWTscopes/claims, andtoolstargeting. NewextAuthfield for external authorization delegation.BackendSecurityPolicy.targetRefsexpansion: Now acceptsInferencePool(group:inference.networking.x-k8s.io) in addition toAIServiceBackend.
Deprecations
AIGatewayFilterConfigExternalProcessor.resources: Theresourcesfield inAIGatewayFilterConfigExternalProcessoris deprecated. UseGatewayConfigfor gateway-scoped resource configuration instead. This field will be removed in the next v0.6.versionfield as prefix for OpenAI schema: Using theversionfield as a prefix for OpenAI schema is deprecated. Use the newprefixfield instead. The legacy behavior will be removed in the next v0.6.
🐛 Bug Fixes
- AWS Bedrock Claude streaming reliability: Streaming responses from Bedrock Claude models now complete correctly. Previously, some streamed responses could be truncated or malformed.
- Gemini streaming token counts: Token usage in Gemini streaming responses now matches OpenAI format, fixing integrations that parse usage from stream chunks.
- Multi-chunk Gemini tool calls: Tool calls that span multiple streaming chunks now have correct indices, preventing function calling errors with Gemini.
- GCP Claude reasoning content: Reasoning/thinking content now correctly passes through for Claude models on GCP Vertex AI in both requests and responses.
- Zero-weight backend references: Backend references with zero weight no longer cause routing errors, allowing gradual traffic shifts via weight changes.
- Umbrella chart image pull secrets: Helm deployments within umbrella charts now correctly inherit
global.imagePullSecretswhen chart-level secrets aren't set. - GCP global region backends: Vertex AI backends configured with global region now work correctly instead of failing during setup.
- Accurate per-token latency metrics: Fixed integer truncation in
time_per_output_tokencalculation that caused incorrect latency measurements for fast responses. - Anthropic token counting: Improved accuracy of input and output token counts for Anthropic models, ensuring billing metrics match provider reports.
📖 Upgrade Guidance
Migrating to GatewayConfig
If you're using AIGatewayFilterConfigExternalProcessor.resources for container resource configuration, migrate to the new GatewayConfig CRD:
- Create a
GatewayConfigresource with your desired configuration:
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: GatewayConfig
metadata:
name: my-gateway-config
namespace: default
spec:
extProc:
kubernetes:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- Reference the
GatewayConfigfrom your Gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ai-gateway
annotations:
aigateway.envoyproxy.io/gateway-config: my-gateway-config
Migrating Endpoint Prefix Configuration
If you're using the version field as a prefix for OpenAI-compatible backends, migrate to the new prefix field:
Before:
schema:
name: OpenAI
version: "/v1beta/openai" # Deprecated usage
After:
schema:
name: OpenAI
prefix: "/v1beta/openai"
📦 Dependencies Versions
Updated to Go 1.25.6 for improved performance and security fixes.
Built on Envoy Gateway v1.6 for enhanced data plane capabilities and stability improvements.
Leveraging Envoy Proxy v1.36.4 with the latest networking and security features.
Support for Gateway API v1.4.0 specifications.
Integration with Gateway API Inference Extension v1.0.2 for stable intelligent endpoint selection.
⏩ Patch Releases
🙏 Acknowledgements
We extend our gratitude to all contributors who made this release possible. Special thanks to:
- The growing community of adopters including Bloomberg, LY Corporation, Alan by Comma Soft, and NRP for their valuable feedback and production insights
- Everyone who reported bugs, submitted PRs, and participated in design discussions
- The Envoy Gateway team for their continued collaboration
🔮 What's Next
We're already working on exciting features for future releases:
- Additional provider integrations - AWS Bedrock InvokeModel API support for Claude and GPT models, Gemini embeddings, and Azure/AKS workload identity
- Batch inference APIs - Support for batch processing to improve throughput for high-volume workloads
- Advanced caching strategies - Prompt cache key and retention controls for OpenAI chat completions
- Upstream provider quota policies - New API for managing upstream provider quotas
- Sensitive data redaction - Request and response body redaction for protecting sensitive information