Version: latest

HTTPRoute + InferencePool Guide

This guide shows how to use InferencePool with the standard Gateway API HTTPRoute for intelligent inference routing. This approach provides basic load balancing and endpoint selection capabilities for inference workloads.

Prerequisites

Before starting, ensure you have:

Kubernetes cluster with Gateway API support
Envoy Gateway installed and configured

Step 1: Install Gateway API Inference Extension

Install the Gateway API Inference Extension CRDs and controller:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml

After installing InferencePool CRD, enable InferencePool support in Envoy Gateway, restart the deployment, and wait for it to be ready:

kubectl apply -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/examples/inference-pool/config.yaml

kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway

kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available

Step 2: Ensure Envoy Gateway is configured for InferencePool

See Envoy Gateway Installation Guide

Step 3: Deploy Inference Backend

Deploy a sample inference backend that will serve as your inference endpoints:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml

This creates a simulated vLLM deployment with multiple replicas that can handle inference requests.

Note: This deployment creates the vllm-llama3-8b-instruct InferencePool and related resources that are referenced in the HTTPRoute configuration below.

Step 4: Create InferenceObjective

Create an InferenceObjective resource to define the model configuration:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml

Step 5: Create InferencePool Resources

Deploy the InferencePool and related resources:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml

This creates:

InferencePool resource defining the endpoint selection criteria
Endpoint Picker Provider (EPP) deployment for intelligent routing with advanced scheduling plugins
Associated services and configurations
RBAC permissions for accessing InferencePool and Pod resources

Step 6: Configure Gateway and HTTPRoute

Create a Gateway and HTTPRoute that uses the InferencePool:

cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-pool-with-httproute
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-pool-with-httproute
  namespace: default
spec:
  gatewayClassName: inference-pool-with-httproute
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-pool-with-httproute
  namespace: default
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-pool-with-httproute
      namespace: default
  rules:
    - backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-llama3-8b-instruct
          namespace: default
          port: 8080
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /
      timeouts:
        request: 60s
EOF

Step 7: Test the Configuration

Once deployed, you can test the inference routing:

# Get the Gateway external IP
GATEWAY_IP=$(kubectl get gateway inference-pool-with-httproute -o jsonpath='{.status.addresses[0].value}')

# Send a test inference request
curl -X POST "http://${GATEWAY_IP}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Say this is a test"
      }
    ],
    "model": "meta-llama/Llama-3.1-8B-Instruct"
  }'

How It Works

Request Processing Flow

Client Request: Client sends inference request to the Gateway
Route Matching: HTTPRoute matches the request based on path prefix
InferencePool Resolution: Envoy Gateway resolves the InferencePool backend reference
Endpoint Selection: Endpoint Picker Provider (EPP) selects the optimal endpoint
Request Forwarding: Request is forwarded to the selected inference backend
Response Return: Response is returned to the client

Next Steps

Explore AIGatewayRoute + InferencePool for advanced AI-specific features
Review monitoring and observability best practices for inference workloads
Learn more about the Gateway API Inference Extension for advanced endpoint picker configurations

Prerequisites​

Step 1: Install Gateway API Inference Extension​

Step 2: Ensure Envoy Gateway is configured for InferencePool​

Step 3: Deploy Inference Backend​

Step 4: Create InferenceObjective​

Step 5: Create InferencePool Resources​

Step 6: Configure Gateway and HTTPRoute​

Step 7: Test the Configuration​

How It Works​

Request Processing Flow​

Next Steps​