HTTPRoute + InferencePool Guide
This guide shows how to use InferencePool with the standard Gateway API HTTPRoute for intelligent inference routing. This approach provides basic load balancing and endpoint selection capabilities for inference workloads.
Prerequisites
Before starting, ensure you have:
- Kubernetes cluster with Gateway API support
- Envoy Gateway installed and configured
Step 1: Install Gateway API Inference Extension
Install the Gateway API Inference Extension CRDs and controller:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.5.1/manifests.yaml
After installing InferencePool CRD, enabled InferencePool support in Envoy Gateway, restart the deployment, and wait for it to be ready:
kubectl apply -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/examples/inference-pool/config.yaml
kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway
kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
Step 2: Deploy Inference Backend
Deploy a sample inference backend that will serve as your inference endpoints:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v0.5.1/config/manifests/vllm/sim-deployment.yaml
This creates a simulated vLLM deployment with multiple replicas that can handle inference requests.
Note: This deployment creates the
vllm-llama3-8b-instruct
InferencePool and related resources that are referenced in the HTTPRoute configuration below.
Step 3: Create InferenceModel
Create an InferenceModel resource to define the model configuration:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v0.5.1/config/manifests/inferencemodel.yaml
Step 4: Create InferencePool Resources
Deploy the InferencePool and related resources:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v0.5.1/config/manifests/inferencepool-resources.yaml
This creates:
- InferencePool resource defining the endpoint selection criteria
- Endpoint Picker Provider (EPP) deployment for intelligent routing
- Associated services and configurations
Step 5: Configure Gateway and HTTPRoute
Create a Gateway and HTTPRoute that uses the InferencePool:
cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-pool-with-httproute
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-pool-with-httproute
namespace: default
spec:
gatewayClassName: inference-pool-with-httproute
listeners:
- name: http
protocol: HTTP
port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-pool-with-httproute
namespace: default
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-pool-with-httproute
namespace: default
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
namespace: default
weight: 1
matches:
- path:
type: PathPrefix
value: /
timeouts:
request: 60s
EOF
Step 6: Test the Configuration
Once deployed, you can test the inference routing:
# Get the Gateway external IP
GATEWAY_IP=$(kubectl get gateway inference-pool-with-httproute -o jsonpath='{.status.addresses[0].value}')
# Send a test inference request
curl -X POST "http://${GATEWAY_IP}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Say this is a test"
}
],
"model": "meta-llama/Llama-3.1-8B-Instruct"
}'
How It Works
Request Processing Flow
- Client Request: Client sends inference request to the Gateway
- Route Matching: HTTPRoute matches the request based on path prefix
- InferencePool Resolution: Envoy Gateway resolves the InferencePool backend reference
- Endpoint Selection: Endpoint Picker Provider (EPP) selects the optimal endpoint
- Request Forwarding: Request is forwarded to the selected inference backend
- Response Return: Response is returned to the client
Next Steps
- Explore AIGatewayRoute + InferencePool for advanced AI-specific features
- Review monitoring and observability best practices for inference workloads
- Learn more about the Gateway API Inference Extension for advanced endpoint picker configurations