Version: latest

AI Gateway Glossary

This glossary provides definitions for key terms and concepts used in AI Gateway and GenAI traffic handling.

Quick Reference

Term	Quick Definition
GenAI Gateway	Gateway for managing AI model traffic
Foundation Model	Base pre-trained AI model
Token	Basic unit of text in LLM processing
Token Usage	Monitoring and limiting model resource consumption
Model Routing	Directing requests to appropriate models
Prompt	Input text guiding AI model response
Temperature	Control for model output randomness

AI/ML Fundamentals

Context Window

The maximum amount of text (in tokens) that a model can process in a single request.

Related: Token

Prompt

The input text that guides the AI model's response, including instructions, context, and specific queries.

Temperature

A parameter that controls the randomness/creativity of model outputs, typically ranging from 0 (deterministic) to 1 (more creative).

Token

The basic unit of text processing in LLMs, representing parts of words or characters.

Related: Context Window · Token Cost

Token Cost

The financial or resource cost associated with token usage in model requests.

Related: Token · Rate of LLM Token Consumption

Content & Safety

Content Filtering

A mechanism to screen and moderate AI-generated content to ensure compliance with ethical standards, company policies, or regulatory requirements.

Gateway Components

Gateway API Inference Extension

A Kubernetes SIG Network extension for Gateway API that provides specialized routing and load balancing capabilities for AI/ML workloads, handling traffic management at the level of inference instances.

Related: Inference Instance

GenAI Gateway

A specialized gateway solution designed to manage, monitor, and route traffic to Generative AI models. It provides capabilities such as load balancing, authorization, token usage monitoring, and integration with multiple model providers.

Related: Token · Model Provider

Hybrid GenAI Gateway

A GenAI Gateway configuration that supports both local inference instances and external cloud-based AI models, providing flexibility in deployment and cost management.

Related: GenAI Gateway · Inference Instance · Model Provider

Inference Infrastructure

Inference Instance

An individual compute resource or container used to run a machine learning model for generating AI outputs (inference).

Inference Service

A service that provides model inference capabilities, including model loading, input processing, inference execution, and output formatting.

Related: Inference Instance

Model Endpoint

The API endpoint provided by a specific AI model, whether hosted by a cloud provider, open-source solution, or private deployment.

Model Provider

Services providing AI model capabilities through APIs, which can be either first-party providers who develop their own models (like OpenAI, Anthropic) or third-party providers who host other companies' models (like AWS Bedrock, Azure OpenAI Service).

Model Types & Management

Fine-Tuned Model

A version of a base Generative AI model that has been customized for specific tasks or domains using additional training data.

Related: Foundation Model

Foundation Model

Foundation models are large-scale, pre-trained AI models designed to handle a broad range of tasks. They are trained on extensive datasets and can be fine-tuned or adapted to specific use cases.

Related: Fine-Tuned Model

Model Routing

A feature in GenAI Gateways that dynamically routes requests to specific models or model versions based on client configuration, use case requirements, or service level agreements.

Related: GenAI Gateway

Usage & Analytics

GenAI Usage Analytics

The collection and analysis of data regarding how users interact with AI models via the GenAI Gateway, including token usage, request patterns, and latency metrics.

Related: GenAI Gateway · Token

GenAI Usage Monitoring

The tracking of resource consumption across different types of models, including token-based monitoring for LLMs, image resolution and compute resources for LVMs, and combined metrics for multimodal models.

Related: Token

LLM Token Usage Limiting

A mechanism to monitor and control the number of tokens processed by an LLM GenAI model, including input, output, and total token limits.

Related: Token · GenAI Gateway

Rate of LLM Token Consumption

The speed at which tokens are consumed by an AI model during processing. This metric is crucial for cost estimation and performance optimization.

Related: Token

note

This glossary is continuously evolving as the field of GenAI traffic handling develops. If you'd like to contribute or suggest changes, please visit our GitHub repository.

Quick Reference​

Categories​

AI/ML Fundamentals​

Context Window​

Prompt​

Temperature​

Token​

Token Cost​

Content & Safety​

Content Filtering​

Gateway Components​

Gateway API Inference Extension​

GenAI Gateway​

Hybrid GenAI Gateway​

Inference Infrastructure​

Inference Instance​

Inference Service​

Model Endpoint​

Model Provider​

Model Types & Management​

Fine-Tuned Model​

Foundation Model​

Model Routing​

Usage & Analytics​

GenAI Usage Analytics​

GenAI Usage Monitoring​

LLM Token Usage Limiting​

Rate of LLM Token Consumption​

Quick Reference

Categories

AI/ML Fundamentals

Context Window

Prompt

Temperature

Token

Token Cost

Content & Safety

Content Filtering

Gateway Components

Gateway API Inference Extension

GenAI Gateway

Hybrid GenAI Gateway

Inference Infrastructure

Inference Instance

Inference Service

Model Endpoint

Model Provider

Model Types & Management

Fine-Tuned Model

Foundation Model

Model Routing

Usage & Analytics

GenAI Usage Analytics

GenAI Usage Monitoring

LLM Token Usage Limiting

Rate of LLM Token Consumption