AI Glossary
This glossary provides 58 essential AI terms curated for software engineers, developers, and project managers working with modern AI systems. Each term includes its domain context, definition, historical origin, and practical usage notes.
Foundation Model Concepts
These terms describe how large language models work under the hood — the building blocks that determine capabilities, costs, and constraints of every AI-powered feature you ship.
1. Token
ML/AI, Software Development — ~2018 (GPT-1/BERT era)
The smallest unit a language model processes — typically a word fragment of roughly 3–4 English characters. All input and output is tokenized, including text, images, and audio. Token counts directly determine API cost and context limits.
In practice: When you send a prompt to any LLM API, it is split into tokens before processing. A 1,000-word document is roughly 750 tokens. API pricing is almost always per-token, so understanding tokenization is essential for cost estimation. Use provider tokenizer tools (OpenAI's Tiktoken, Anthropic's token counter, Google's count_tokens()) to validate prompt sizes before production calls.
2. Context Window
ML/AI, Software Development — ~2020 (GPT-3)
The total number of tokens a model can process in a single request — its "working memory." Encompasses the system prompt, conversation history, user input, retrieved context, and the generated response.
In practice: Context windows range from 8K tokens (older models) to 1M+ tokens (Gemini 2.5 Pro, Claude with extended context). A larger window lets you include more documents or conversation history, but increases cost. Practitioners must plan prompt architecture to leave room for output tokens. The Forge program identifies "context engineering" as a critical skill — deciding what goes into the window and what stays out.
3. Embedding
ML/AI, Software Engineering — ~2013 (Word2Vec), mainstream ~2019
A numerical vector representation of text (or other data) that captures semantic meaning. Texts with similar meanings produce vectors that are close together in high-dimensional space, enabling similarity search, clustering, and classification.
In practice: Embeddings power search, recommendation engines, and RAG pipelines. You generate embeddings using a dedicated model (OpenAI's text-embedding-3, Google's gemini-embedding-001, or Voyage AI for Anthropic users), then store them in a vector database. At query time, you embed the query and retrieve the most similar stored vectors using cosine similarity or dot product.
4. Temperature
ML/AI, Software Development — ~2019 (GPT-2 era)
A generation parameter (typically 0.0–2.0) controlling randomness in token selection. Lower values produce deterministic, focused output; higher values increase creativity and diversity.
In practice: Set temperature near 0 for data extraction, classification, and code generation where consistency matters. Use higher values (0.7–1.0) for creative writing or brainstorming. Google's Gemini 3 models are optimized at temperature 1.0 and may degrade at lower settings — always check provider-specific guidance.
5. Top-p (Nucleus Sampling)
ML/AI Research, Software Development — ~2019 (Holtzman et al.)
A generation parameter that limits token selection to the smallest set of candidates whose cumulative probability exceeds a threshold p. With top-p of 0.95, the model considers tokens representing 95% of the probability mass.
In practice: Top-p and temperature both control output diversity but work differently. Most providers recommend adjusting one or the other, not both simultaneously. Top-p is favored when you want to dynamically adjust the candidate pool size based on model confidence per token.
6. Large Language Model (LLM)
ML/AI, All Tech Domains — ~2020 (GPT-3 launch popularized the term)
An AI model with billions of parameters trained on vast text corpora using self-supervised learning. LLMs can generate human-like text, answer questions, write code, analyze documents, and perform reasoning across domains.
In practice: LLMs are the foundation of modern AI assistants (Claude, GPT, Gemini). They are pretrained on broad data and then fine-tuned with RLHF or similar techniques for helpfulness and safety. Software teams interact with LLMs through APIs, IDE integrations (Copilot, Cursor), and agentic frameworks. Understanding that LLMs are probabilistic text generators — not databases or search engines — is key to using them effectively.
7. Foundation Model
ML/AI Research, Product Management — ~2021 (Stanford coined the term)
A large-scale model trained on broad data that can be adapted to many downstream tasks through fine-tuning, prompting, or RAG — as opposed to task-specific models trained for a single purpose.
In practice: Foundation models include GPT-4, Claude, Gemini, and Llama. The "foundation" metaphor reflects how one base model supports many applications. Product managers evaluate foundation models on capability benchmarks, cost, latency, and licensing when selecting which to build on.
8. Multimodal Model
ML/AI, Product Management — ~2023 (GPT-4V, Gemini 1.0)
A model capable of processing and generating content across multiple modalities — text, images, audio, video, and code — within a single architecture.
In practice: Modern models like GPT-4o, Claude (with vision), and Gemini 3 natively handle mixed inputs. For example, you can upload a screenshot of a UI mockup and ask the model to generate the corresponding code. Images consume tokens too — roughly 258–1,120 tokens per image depending on resolution and provider.
9. Hallucination
ML/AI, All Tech Domains — ~2020 (became widely discussed with GPT-3)
When a model generates plausible-sounding but factually incorrect, fabricated, or unsupported information. Occurs especially when the model lacks relevant training data or encounters ambiguous input.
In practice: Hallucinations are the single biggest reliability risk in production AI systems. Mitigate them by: grounding responses with RAG or search tools, instructing the model to say "I don't know," providing reference documents and requiring citations, and using evaluation frameworks to measure factual accuracy. Google reports that grounding with Search reduces hallucinations by ~40%.
10. Latency / TTFT
Software Engineering, ML/AI — ~2023 (became key metric with streaming APIs)
Latency is the total time from prompt submission to complete response delivery. TTFT (Time to First Token) specifically measures how quickly the first token of output appears — critical for interactive user experiences.
In practice: In production applications, TTFT directly affects perceived responsiveness. Streaming APIs send tokens as they are generated rather than waiting for the complete response, dramatically improving user experience. Extended thinking / reasoning features increase TTFT because the model "thinks" before responding.
Prompting Techniques
Prompting is how practitioners communicate intent to AI models. These techniques range from basic instruction design to advanced multi-step reasoning strategies.
11. Prompt Engineering
Software Development, Product Management — ~2020 (GPT-3 era)
The practice of designing, structuring, and iteratively refining input prompts to optimize model outputs for accuracy, format, tone, and task completion.
In practice: Prompt engineering is the most accessible way to improve AI output quality without any model training. Core strategies include: writing clear and specific instructions, providing reference context, breaking complex tasks into subtasks, and systematic testing. Each provider publishes prompt engineering guides with model-specific advice — these differ meaningfully between GPT, Claude, and Gemini.
12. System Prompt / System Instructions
Software Development, ML/AI — ~2022 (ChatGPT launch)
A special message provided separately from user input that establishes the model's persona, behavioral rules, constraints, and output format expectations for an entire conversation.
In practice: System prompts are the primary mechanism for product-level AI behavior configuration. A well-crafted system prompt defines: who the AI is, what it should and should not do, how it should format responses, and what domain knowledge to prioritize. In the Forge program, this maps to "spec-driven development" where machine-readable specifications guide AI behavior.
13. Zero-Shot Prompting
ML/AI Research, Software Development — ~2019 (GPT-2/3 research)
Providing a model with only instructions and no examples, relying entirely on the model's training to produce the desired output.
In practice: Zero-shot works well for straightforward, well-defined tasks where the model has strong baseline capability (summarization, translation, simple Q&A). For complex or domain-specific tasks, zero-shot often produces inconsistent results and should be upgraded to few-shot or chain-of-thought approaches.
14. Few-Shot Prompting
ML/AI Research, Software Development — ~2020 (GPT-3's breakthrough capability)
Providing a model with a small number (typically 3–5) of input-output example pairs in the prompt to demonstrate the desired pattern, format, or reasoning approach.
In practice: Few-shot prompting is one of the most reliable techniques for improving output quality without fine-tuning. Include diverse, representative examples that cover edge cases. Anthropic recommends wrapping examples in XML tags for clarity. The technique is especially effective for classification, formatting, style matching, and structured data extraction.
15. Chain-of-Thought (CoT) Prompting
ML/AI Research, Software Development — ~2022 (Wei et al., Google Brain)
A technique that instructs the model to break down its reasoning into explicit intermediate steps before producing a final answer, improving accuracy on complex logical, mathematical, and analytical tasks.
In practice: CoT dramatically improves performance on multi-step reasoning tasks. In practice, add "Think step by step" or more specific reasoning instructions to your prompt. Modern reasoning models (o1, o3, Gemini 3 with thinking, Claude with extended thinking) build CoT natively into their architecture, generating internal "thinking tokens" before the visible response.
16. Prompt Chaining
Software Development, ML/AI — ~2022
Breaking a complex task into a sequence of simpler subtasks where each step's output becomes input for the next, allowing focused processing and easier debugging at each stage.
In practice: Prompt chaining is essential for production AI workflows. The most common pattern is generate → review → refine. For example: (1) generate code from a specification, (2) have the model review the code for bugs, (3) have the model fix identified issues. Each step gets the model's full attention and produces a traceable, debuggable artifact.
17. Prompt Caching
Software Development, Cost Optimization — ~2024 (Anthropic, OpenAI, Google all launched caching)
A platform feature that stores and reuses previously processed prompt content across API calls, reducing both cost (up to 90%) and latency (up to 80%) for repeated long-context requests.
In practice: Prompt caching is critical for cost optimization when building applications that repeatedly query the same large documents or system prompts. Anthropic charges 25% more for cache writes but only 10% of base price for cache hits. OpenAI automatically caches prompts longer than 1,024 tokens. Google offers both implicit (automatic) and explicit caching with a 90% discount on cached tokens.
18. Structured Output
Software Development, ML/AI — ~2024 (became widely available)
A feature that guarantees model responses conform to a developer-defined JSON Schema, eliminating format errors, missing fields, and hallucinated values in structured data.
In practice: Structured outputs are essential for production systems that parse model responses programmatically. Enable with strict: true in tool definitions (Anthropic/OpenAI) or response_json_schema (Google). OpenAI's implementation uses constrained decoding with context-free grammars to achieve 100% schema compliance. Always validate semantics in application code — structural correctness does not guarantee factual correctness.
RAG, Retrieval, and Grounding
Retrieval-Augmented Generation connects models to external knowledge, dramatically improving factual accuracy and enabling domain-specific applications.
19. Retrieval-Augmented Generation (RAG)
ML/AI, Software Engineering — ~2020 (Lewis et al., Meta AI)
A technique that retrieves relevant information from external knowledge sources at query time and injects it into the model's context, improving factual accuracy without retraining the model.
In practice: RAG is the most popular architecture for building knowledge-grounded AI applications. The standard pipeline is: chunk documents → generate embeddings → store in a vector database → at query time, embed the query → retrieve similar chunks → pass as context to the model. RAG reduces hallucinations, provides citations, and keeps information current without fine-tuning.
20. Vector Database
Software Engineering, ML/AI — ~2021 (Pinecone, Weaviate, Milvus gained traction)
A specialized database optimized for storing, indexing, and querying high-dimensional embedding vectors using similarity search (cosine similarity, dot product, or Euclidean distance).
In practice: Vector databases are the retrieval backbone of RAG systems. Popular options include Pinecone, Weaviate, ChromaDB, FAISS, and managed offerings from cloud providers (Google AlloyDB, OpenAI's hosted vector stores). They support metadata filtering, enabling hybrid search that combines semantic similarity with traditional attribute-based filtering.
21. Chunking
Software Engineering, ML/AI — ~2022 (as RAG became mainstream)
The process of splitting large documents into smaller, semantically meaningful segments before embedding and storing them for retrieval. Chunk size and overlap strategy directly impact retrieval quality.
In practice: Chunking strategy is a critical design decision in RAG systems. Too-large chunks dilute relevance; too-small chunks lose context. Common approaches include fixed-size (e.g., 512 tokens with 50-token overlap), recursive text splitting (by paragraph, then sentence), and semantic chunking (using embeddings to find natural boundaries). Experiment to find the best strategy for your content.
22. Grounding
ML/AI, Software Engineering — ~2023 (Google popularized the term)
Connecting model responses to verifiable external sources (search results, documents, databases) so outputs are anchored in factual evidence rather than relying solely on training data.
In practice: Google's Gemini API offers built-in grounding with Google Search — the model automatically determines when to search, retrieves results, and generates cited responses. This reduces hallucinations by ~40%. More broadly, grounding is any technique that provides factual context to constrain model outputs: RAG, tool use to query databases, or requiring citations from provided documents.
23. Context Engineering
Software Engineering, ML/AI — ~2024–2025 (emerged as a practice term)
The discipline of strategically designing what information enters a model's context window and how it is organized — encompassing prompt structure, retrieved context selection, memory management, and progressive disclosure of information.
In practice: The Forge program identifies context engineering as "probably more important than anything else." As context windows grow to 1M+ tokens, the challenge shifts from fitting information in to curating the right information. Key techniques include progressive disclosure (revealing context as needed), compaction (summarizing long conversations), and intelligent retrieval that surfaces only the most relevant chunks.
Agents and Agentic Workflows
Agentic AI represents the shift from single-turn Q&A to autonomous, multi-step systems that use tools, make decisions, and complete complex tasks.
24. AI Agent
Software Engineering, ML/AI — ~2023 (AutoGPT, LangChain agents)
An autonomous system that uses an LLM to interpret goals, plan actions, invoke tools, and iterate toward task completion across multiple steps — making decisions rather than just generating text.
In practice: AI agents are the most consequential development in applied AI. They combine an LLM "brain" with tools (APIs, code execution, file access) and a control loop. The agent decides what tool to call, interprets results, and determines next steps. The Forge program is built around this concept — engineers orchestrate AI agents that write 99%+ of the code.
25. Tool Use / Function Calling
Software Development, ML/AI — ~2023 (OpenAI function calling, Anthropic tool use)
The capability for an LLM to generate structured requests to invoke external tools or APIs. The model determines when a tool is needed, produces properly formatted arguments, and the application code executes the function and returns results.
In practice: Tool use is what transforms a chatbot into a capable agent. You define tool schemas (name, description, parameters), and the model outputs tool call requests when appropriate. Best practices: follow the principle of least privilege, always validate generated arguments, write clear tool descriptions, and support parallel tool calls for efficiency.
26. Model Context Protocol (MCP)
Software Engineering, ML/AI — ~2024 (Anthropic introduced MCP)
An open protocol that standardizes how applications provide context and tools to LLMs — described by Anthropic as "a USB-C port for AI applications." Enables interoperable tool and data source connections across different AI platforms.
In practice: MCP creates a standard interface between AI models and external systems, replacing ad-hoc integrations. An MCP server exposes tools and resources; any MCP-compatible client (Claude Code, IDEs, custom apps) can connect to it. The Forge program includes a dedicated quest on MCP server setup and protocol understanding. This is becoming an industry standard for AI tool integration.
27. Subagent / Multi-Agent Orchestration
Software Engineering, ML/AI — ~2024 (Claude Code, OpenAI Agents SDK, Google ADK)
A pattern where a primary agent delegates specialized tasks to subordinate agents, each running in its own context window with custom instructions, specific tool access, and independent permissions.
In practice: Multi-agent systems solve the context-pollution problem — instead of one overloaded agent, specialized subagents handle distinct tasks (code writing, testing, documentation). The Forge program dedicates an entire quest realm to agent spawning, parallel execution, and subagent delegation. OpenAI's Agents SDK implements "handoffs" for transferring control between agents; Google's ADK supports hierarchical agent topologies.
28. Agentic Workflow
Software Development, Project Management — ~2023–2024
A multi-step, autonomous process where AI models use tools, make decisions, evaluate results, and take corrective actions across multiple turns to complete complex tasks without constant human intervention.
In practice: Agentic workflows are the production pattern for AI in software engineering. A typical agentic coding workflow: (1) read the specification, (2) plan implementation, (3) write code, (4) run tests, (5) fix failures, (6) submit for review. Three critical elements for agent prompts: persistence (keep trying on failure), tool-calling guidance (when and how to use tools), and planning (think before acting).
29. Computer Use
Software Engineering, ML/AI — ~2024 (Anthropic launched computer use beta)
The capability for an AI model to interact with computer environments through screenshots, mouse clicks, and keyboard input — enabling autonomous desktop interaction and UI testing.
In practice: Computer use extends agents beyond APIs to graphical interfaces. The model sees screenshots, identifies UI elements, and issues click/type commands. Current applications include automated testing, form filling, and legacy system interaction. The technology is still maturing — accuracy on complex UI navigation is limited, and it should be used for background tasks rather than real-time operations.
Fine-Tuning and Model Customization
These techniques let organizations adapt foundation models to specific domains, styles, or tasks when prompting alone is insufficient.
30. Fine-Tuning
ML/AI, Software Engineering — ~2018 (BERT fine-tuning), mainstream ~2023
The process of further training a pretrained foundation model on custom examples (input-output pairs) to adapt it to specific domains, tasks, writing styles, or organizational knowledge.
In practice: Fine-tuning creates specialized models that outperform prompting for specific tasks. OpenAI recommends minimum 50–100 examples to start. The decision framework is: try prompt engineering first → then RAG → then fine-tuning → then custom training. Fine-tuned models need safety evaluation before deployment. Common use cases include: domain-specific terminology, consistent output formatting, and capturing organizational style.
31. Supervised Fine-Tuning (SFT)
ML/AI Research — ~2018 (BERT), widespread ~2023
The most common fine-tuning method where you provide labeled example pairs of inputs and known correct outputs. The model learns to reproduce the style, format, and content patterns demonstrated in training examples.
In practice: SFT is the default approach for most fine-tuning projects. Training data is formatted as JSONL files with message arrays including system, user, and assistant turns. Key practices: include the same system prompt in every training example, split data into training and test sets, and track loss metrics to detect overfitting. Vertex AI uses LoRA (a PEFT method) for efficient fine-tuning.
32. LoRA / PEFT
ML/AI Research — ~2021 (Hu et al. introduced LoRA)
Parameter-Efficient Fine-Tuning (PEFT) methods that freeze original model weights and only update a small set of newly added parameters. LoRA (Low-Rank Adaptation) is the most popular PEFT technique, injecting small trainable matrices into transformer layers.
In practice: LoRA makes fine-tuning accessible by reducing compute, memory, and storage requirements dramatically. Instead of updating billions of parameters, LoRA adds adapter weights measured in megabytes rather than gigabytes. Google's Vertex AI uses LoRA for all Gemini fine-tuning. This enables organizations to create multiple task-specific adapters from a single base model.
33. RLHF (Reinforcement Learning from Human Feedback)
ML/AI Research — ~2020 (InstructGPT, 2022 widely known via ChatGPT)
A training technique where human evaluators rank model outputs by quality, and the model is trained via reinforcement learning to prefer higher-ranked responses. This aligns model behavior with human preferences for helpfulness, accuracy, and safety.
In practice: RLHF is how foundation models become useful assistants. The process: (1) collect human preference data by having annotators compare model outputs, (2) train a reward model from these preferences, (3) use the reward model to fine-tune the LLM via reinforcement learning. Anthropic extends this with Constitutional AI, where AI-generated feedback supplements human feedback.
34. Distillation
ML/AI Research, Software Engineering — ~2015 (Hinton et al.), applied to LLMs ~2023
Using outputs from a larger, more capable "teacher" model to create training data for fine-tuning a smaller, cheaper "student" model, transferring task-specific knowledge at lower inference cost.
In practice: Distillation is a key cost optimization strategy. Workflow: tune your prompt on a frontier model (e.g., GPT-4.1 or Claude Opus) → capture high-quality outputs → use those outputs as training data for a smaller model (e.g., GPT-4.1-mini or Claude Haiku). This achieves similar task-specific performance at dramatically lower inference cost — critical for high-volume production applications.
AI Safety and Responsible AI
Safety concepts are essential for any team deploying AI in production, covering everything from content filtering to alignment research.
35. Constitutional AI (CAI)
ML/AI Research, AI Safety — ~2022 (Anthropic's December 2022 paper)
Anthropic's approach to AI alignment that provides a model with a set of explicit principles (a "constitution") against which it evaluates and revises its own outputs, reducing reliance on human feedback for safety training.
In practice: CAI works in two phases: (1) supervised learning where the model critiques and revises its own responses based on constitutional principles, and (2) RLAIF (RL from AI Feedback) where the model's own judgments replace human annotators. The constitution draws from sources like the Universal Declaration of Human Rights. This approach is more scalable and transparent than pure RLHF.
36. Prompt Injection
Software Engineering, AI Safety — ~2022 (widely recognized with ChatGPT)
A security vulnerability where malicious user input manipulates a model into ignoring its instructions, revealing system prompts, or performing unauthorized actions — analogous to SQL injection in traditional applications.
In practice: Prompt injection is the most critical security concern in production AI applications. Attack types include direct injection (explicit override commands) and indirect injection (malicious instructions hidden in retrieved content). Mitigation strategies: use narrow-scoped tools with least privilege, implement input/output guardrails, use Google's Model Armor or similar runtime defense, test extensively with adversarial inputs, and never trust model output for security-critical decisions.
37. Guardrails
Software Engineering, AI Safety — ~2023 (NeMo Guardrails, Agents SDK)
Configurable safety checks that validate AI inputs and outputs against defined policies before processing or returning results. Include input guardrails (validating user input) and output guardrails (checking model responses).
In practice: Guardrails are the production safety layer between users and models. Implement as: content moderation (checking for harmful content), schema validation (ensuring structured outputs), business logic checks (verifying responses match domain rules), and PII detection (preventing data leakage). OpenAI provides a free Moderation API; Google offers Model Armor; and all major agent frameworks support custom guardrail functions.
38. AI Alignment
ML/AI Research, AI Safety — ~2016 (research community), mainstream ~2022
The challenge of ensuring AI systems reliably pursue intended goals and behave according to human values, even in novel situations not explicitly covered during training.
In practice: Alignment is the overarching goal that RLHF, Constitutional AI, and safety training attempt to achieve. Anthropic frames alignment through the HHH (Helpful, Honest, Harmless) framework. For practitioners, alignment manifests as: does the model follow system prompt instructions faithfully? Does it refuse harmful requests? Does it acknowledge uncertainty rather than hallucinating? Understanding alignment helps practitioners write better system prompts and design more robust applications.
39. Content Moderation / Safety Filters
Software Development, AI Safety — ~2022 (integrated into commercial APIs)
Automated systems that classify text and images against harm categories (harassment, hate speech, violence, sexual content, dangerous activities) and optionally block content exceeding defined thresholds.
In practice: Every production AI application needs content moderation. OpenAI provides a free Moderation API using GPT-4o. Google offers configurable safety settings with threshold levels (BLOCK_LOW, BLOCK_MEDIUM, BLOCK_HIGH, OFF). Anthropic builds safety into Claude's training. Best practice: layer defenses — use both model-level safety training and application-level moderation checks.
40. Responsible AI
All Tech Domains, Project Management — ~2019 (corporate AI ethics frameworks)
The practice of developing, deploying, and governing AI systems ethically — encompassing fairness, transparency, accountability, privacy, safety, and societal impact considerations throughout the AI lifecycle.
In practice: Responsible AI is increasingly a business and regulatory requirement, not just an ethical aspiration. The EU AI Act (2024) mandates risk assessments and transparency for high-risk AI systems. For software teams, responsible AI means: documenting model capabilities and limitations, testing for bias across demographic groups, implementing human oversight for high-stakes decisions, maintaining audit trails of AI-generated content, and establishing clear governance policies.
Evaluation and Testing
Traditional software testing doesn't fully apply to probabilistic AI systems. These terms cover the emerging discipline of AI evaluation.
41. Evals (Evaluations)
ML/AI, Software Engineering — ~2023 (OpenAI Evals framework)
Structured, repeatable tests that measure AI model or application performance against defined success criteria, benchmarks, and thresholds. The AI equivalent of a test suite.
In practice: Evals are the foundation of reliable AI development. The principle: "you can't improve what you don't measure." Build evals before iterating on prompts. Key eval dimensions: accuracy (is it correct?), format compliance (does it match the schema?), latency (is it fast enough?), cost (is it affordable?), and safety (does it refuse harmful requests?). Run evals on every prompt change, model upgrade, or system modification.
42. LLM-as-Judge / Model Grading
ML/AI, Software Engineering — ~2023
Using a capable LLM to evaluate the outputs of another model against defined criteria — automating evaluation that would otherwise require human reviewers. Types include string-match graders, code-based validators, and model-based (LLM judge) assessments.
In practice: LLM-as-judge enables scalable evaluation of subjective qualities (helpfulness, coherence, completeness) that cannot be measured with simple string matching. Best practices: use the most capable available model as the judge (e.g., GPT-4.1, Claude Opus), add chain-of-thought reasoning before scoring, control for response-length bias, and calibrate automated scores against human annotations periodically.
43. Benchmark
ML/AI Research — ~2018 (GLUE/SuperGLUE), proliferated ~2023
A standardized test dataset and evaluation protocol used to compare model performance across providers and versions. Examples include MMLU (knowledge), HumanEval (code), SWE-bench (software engineering), and GPQA (graduate-level reasoning).
In practice: Benchmarks help teams make model selection decisions but have important limitations: models may be optimized for specific benchmarks (overfitting), benchmarks may not reflect your specific use case, and scores can be gamed. Use benchmarks as a starting point, then run your own task-specific evals. SWE-bench is particularly relevant for software engineering teams evaluating coding assistants.
44. Red Teaming
AI Safety, Software Engineering — ~2022 (adapted from cybersecurity to AI)
Systematic adversarial testing where human testers attempt to elicit harmful, incorrect, or unexpected model behaviors through creative prompting, edge cases, and attack scenarios.
In practice: Red teaming is essential before any production AI deployment. Test for: prompt injection vulnerabilities, bias in outputs across demographic groups, harmful content generation, factual accuracy on domain-specific questions, and behavior under adversarial inputs. Document findings, fix vulnerabilities, and retest. Both OpenAI and Anthropic conduct extensive red teaming before model releases and recommend it for applications.
AI in the Software Development Lifecycle
These terms describe how AI integrates into day-to-day software engineering workflows, from code generation to deployment.
45. AI Pair Programming
Software Development — ~2021 (GitHub Copilot launch)
Using an AI assistant as a real-time collaborative coding partner that suggests code completions, generates functions from natural language descriptions, explains existing code, and assists with debugging.
In practice: AI pair programming is the entry point for most developers adopting AI. Tools range from inline autocomplete (Copilot) to agentic coding assistants (Claude Code, Cursor, Windsurf). The Forge program's fluency framework captures the progression: L1 (AI Helper — using autocomplete) → L2 (AI Pair — conversational coding) → L3 (AI Orchestrator — delegating entire features to AI agents).
46. AI Code Review
Software Development — ~2023
Using AI models to automatically review code changes for bugs, security vulnerabilities, style violations, performance issues, and adherence to best practices, supplementing human code review.
In practice: AI code review integrates into pull request workflows. The model analyzes diffs, identifies potential issues, and suggests improvements. Benefits: faster review cycles, consistent enforcement of coding standards, and catching issues human reviewers might miss. Limitations: AI cannot fully assess business logic correctness or architectural implications. Use as a complement to, not replacement for, human review.
47. Spec-Driven Development (SDD)
Software Development, Project Management — ~2024–2025 (Liatrio methodology)
A methodology where machine-readable specifications serve as the primary input for AI-generated code, following a four-step workflow: Specification → Task Breakdown → Implementation → Validation.
In practice: SDD is the Forge program's core methodology for AI-native engineering. Engineers write detailed Markdown specifications that AI agents consume directly. The spec defines acceptance criteria, constraints, and expected behaviors. The AI agent then decomposes the spec into tasks, implements each task, and validates against the criteria. This creates traceable, auditable development artifacts and ensures AI output matches requirements.
48. AI-Generated Code Validation
Software Development, QA — ~2023
The systematic practice of verifying that AI-generated code meets functional requirements, passes tests, adheres to security standards, and integrates correctly — recognizing that AI output is probabilistic and requires verification.
In practice: Never trust AI-generated code without validation. Best practices: require automated test suites that pass before merging, run static analysis and security scanning, review generated code for logic errors the tests may miss, and maintain "proof artifacts" (test results, screenshots) as evidence of correctness. The Forge program makes validation one of its four core workflow pillars.
MLOps and Deployment
These terms cover the operational side of running AI systems in production, from infrastructure to monitoring.
49. MLOps
Software Engineering, ML/AI — ~2019 (term coined), mature ~2022
The set of practices that combines machine learning, DevOps, and data engineering to reliably and efficiently deploy, monitor, and maintain ML models in production. Encompasses CI/CD for models, monitoring, versioning, and governance.
In practice: MLOps applies DevOps principles to ML systems. Key components: model versioning and registry (tracking which model version is in production), automated training and evaluation pipelines, deployment automation (blue-green, canary for model releases), monitoring for data drift and performance degradation, and reproducibility of training runs. Tools include Vertex AI Pipelines, MLflow, Kubeflow, and cloud-native MLOps services.
50. Model Monitoring / Drift Detection
Software Engineering, ML/AI — ~2020
Tracking deployed model performance over time to detect when output quality degrades due to changes in input data distribution (data drift), changes in the relationship between inputs and outputs (concept drift), or model staleness.
In practice: Production AI systems degrade silently. Input data patterns change, user behavior evolves, and the world moves past the model's training data cutoff. Model monitoring tracks metrics like accuracy, latency, error rates, and token usage. Set alerts for significant changes. Vertex AI Model Monitoring and similar services automate drift detection. For LLM applications, monitoring should include hallucination rate, user satisfaction, and tool call success rates.
51. AIOps
Software Engineering, DevOps — ~2017 (Gartner coined the term)
The application of AI and machine learning to IT operations — automating incident detection, root cause analysis, capacity planning, and remediation through intelligent analysis of operational data.
In practice: AIOps extends beyond AI-powered development into AI-powered operations. Examples: AI-generated infrastructure-as-code, self-healing scripts that detect and fix common failures, intelligent alerting that reduces noise, and ChatOps integrations where operations teams interact with AI assistants for troubleshooting. The Forge program includes a week dedicated to AI-accelerated DevOps covering CI/CD automation, GitOps, and AIOps practices.
Cost and Performance Optimization
Controlling the cost and latency of AI systems is critical for sustainable production deployment. These terms cover the key optimization levers.
52. Batch Processing / Batch API
Software Engineering, Cost Optimization — ~2024 (all major providers launched batch APIs)
Asynchronous processing of large volumes of non-time-sensitive AI requests at significantly reduced pricing (typically 50% discount). Requests are queued and processed within hours rather than seconds.
In practice: Batch processing is the single biggest cost lever for high-volume, non-real-time workloads. Use cases: bulk document processing, generating training data, running evaluations, content classification at scale, and embedding generation. OpenAI's Batch API processes within 24 hours; Anthropic's Message Batches API completes in 5 minutes to 1 hour. Combine with prompt caching for maximum savings.
53. Extended Thinking / Reasoning Models
ML/AI, Software Development — ~2024 (OpenAI o1, Claude extended thinking)
Models or modes that generate internal "thinking tokens" — intermediate reasoning steps — before producing the visible response, significantly improving performance on complex multi-step problems at the cost of higher latency and token usage.
In practice: Reasoning models (OpenAI o1/o3/o4-mini, Claude with extended thinking, Gemini with thinking) trade speed and cost for accuracy. They are most effective for complex coding, mathematics, scientific analysis, and multi-step planning. Key decision: use standard models for simple tasks (chat, summarization) and reasoning models for complex tasks. Control cost via "reasoning effort" parameters (low/medium/high) that dynamically adjust thinking depth.
54. Token Optimization
Software Engineering, Cost Optimization — ~2023
Strategies for reducing token consumption while maintaining output quality — including concise prompting, prompt caching, context compaction, model selection (using smaller models for simpler tasks), and output length control.
In practice: Token usage directly drives cost. Key optimization strategies: (1) use prompt caching for repeated context, (2) select the smallest capable model for each task (route simple queries to cheaper models), (3) compaction to summarize long conversations instead of passing full history, (4) write concise prompts without sacrificing clarity, and (5) set appropriate max_tokens limits to prevent runaway generation.
55. Model Routing / Model Selection
Software Engineering, Product Management — ~2024
The practice of dynamically routing requests to different models based on task complexity, cost constraints, and latency requirements — using frontier models for complex tasks and smaller/cheaper models for simpler ones.
In practice: Model routing is an emerging best practice for production AI systems. Instead of sending every request to the most expensive model, analyze request complexity and route accordingly. For example: use GPT-4.1-mini or Claude Haiku for simple classification and summarization, and GPT-4.1 or Claude Sonnet for complex reasoning and code generation. This can reduce costs by 60-80% while maintaining quality where it matters. Anthropic's effort parameter and Google's thinking levels provide similar within-model optimization.
Context Management
These terms describe the emerging discipline of managing what enters and exits an LLM's context window — a critical skill as agentic workflows push models to their limits.
56. Context Rot
ML/AI Research, Software Engineering — ~2024–2025 (Chroma Research coined the term)
The progressive degradation of model output quality as the context window fills with accumulated tokens. Unlike a hard cutoff, context rot is a gradient — performance declines gradually as context grows, with models increasingly ignoring or misweighting information.
In practice: Research consistently demonstrates this effect: Liu et al. (2024, TACL) showed in "Lost in the Middle" that models ignore information placed in the middle of long contexts, with >20% QA accuracy drops. Du et al. (2025) measured 13.9–85% performance loss from context length alone, with degradation beginning as early as ~7K tokens. Chroma Research named and characterized the gradient nature of this decay. Anthropic's own documentation acknowledges that "performance degrades as context fills." For practitioners, this means proactively managing context is not optional — it is essential for maintaining output quality. See Context Management Best Practices for mitigation strategies.
57. Compaction
Software Engineering, ML/AI — ~2024 (Claude Code introduced
/compact)
The practice of summarizing an ongoing conversation to reduce token count while preserving key state, decisions, and context — effectively "compressing" the conversation to reclaim context space.
In practice: Compaction is the primary defense against context rot. In Claude Code, the /compact command triggers a summarization of the current conversation. You can provide focus instructions (e.g., /compact focus on the database migration) to guide what the summary preserves. Compaction works best alongside persistent state mechanisms like CLAUDE.md files, which survive compaction and ensure critical project context is always available.
58. Context Budget
Software Engineering, ML/AI — ~2024–2025 (practitioner term)
The deliberate practice of strategically managing what information enters and exits a model's context window — treating context capacity as a finite resource that must be allocated, monitored, and reclaimed.
In practice: Context budgeting connects directly to context engineering but focuses on the operational discipline rather than the design discipline. Key practices include: monitoring context utilization via status line indicators, compacting proactively before quality degrades, using subagents to isolate investigation work in separate context windows, and designing workflows with natural "save points" where context can be safely cleared. The Spec-Driven Development workflow exemplifies good context budgeting — each step serializes state to external artifacts, creating natural checkpoints for context management.