AI Evaluations
AI Evaluation Methodology Study Guide
Introduction to AI Evaluation
Evaluation has become the biggest hurdle in bringing AI applications to reality. The more AI is used, the more opportunity there is for catastrophic failure. Without proper quality control of AI outputs, the risk of AI might outweigh its benefits for many applications.
Key Challenges:
- Evaluation can take up the majority of development effort
- Many teams settle for inadequate methods like “eyeballing results”
- Systematic evaluation is needed to make results reliable
Why Foundation Models Are Hard to Evaluate
Core Challenges
- Intelligence Paradox: The more intelligent AI models become, the harder they are to evaluate
- Easy to spot gibberish, harder to validate coherent but wrong summaries
- Evaluation becomes time-consuming for sophisticated tasks
- Open-ended Nature: Undermines traditional ground truth evaluation
- Multiple correct responses possible for any input
- Impossible to curate comprehensive lists of correct outputs
- Black Box Problem: Most foundation models lack transparency
- No access to architecture, training data, or training process details
- Can only evaluate by observing outputs
- Benchmark Saturation: Public benchmarks become obsolete quickly
- GLUE (2018) → SuperGLUE (2019)
- MMLU (2020) → MMLU-Pro (2024)
- Expanded Scope: General-purpose models require broader evaluation
- Must assess performance on known tasks
- Must discover new capabilities
- May extend beyond human capabilities
Language Modeling Metrics
Core Concepts
Entropy: Measures information content per token
- Higher entropy = more information per token
- More predictable languages have lower entropy
Cross Entropy: Measures how difficult it is for a model to predict next tokens
-
Formula: H(P,Q) = H(P) + D_KL(P Q) - Where P = true distribution, Q = learned distribution
Perplexity: Exponential of cross entropy
- Measures uncertainty when predicting next token
- Lower perplexity = better prediction ability
- Common values: as low as 3 for good models
Bits-per-Character (BPC) and Bits-per-Byte (BPB): Standardized measures
- BPC accounts for different tokenization methods
- BPB provides cross-encoding standardization
Practical Applications
- Proxy for model capabilities
- Data contamination detection
- Training data deduplication
- Abnormal text detection
- Compression efficiency measurement
Important Notes
- Perplexity may not reflect post-training performance (SFT, RLHF)
- Post-training typically increases perplexity while improving task performance
- Quantization can affect perplexity unexpectedly
Exact Evaluation Methods
Functional Correctness
Definition: Evaluating whether system performs intended functionality
Applications:
- Code generation (execution accuracy)
- Text-to-SQL generation
- Game bots with measurable scores
- Tasks with clear success criteria
Metrics:
- Pass@k: Fraction of problems solved with k attempts
- Execution accuracy for code
- Success rates for task completion
Similarity Measurements Against Reference Data
Exact Match
- Binary scoring (match or no match)
- Works for short, exact responses
- Variations allow containing reference response
- Limited applicability for complex tasks
Lexical Similarity
Methods:
- Token overlap counting
- Fuzzy matching (edit distance)
- N-gram similarity
Common Metrics: BLEU, ROUGE, METEOR++, TER, CIDEr
Limitations:
- Requires comprehensive reference sets
- Higher similarity ≠ better responses
- Sensitive to reference data quality
Semantic Similarity
Process:
- Transform text to embeddings
- Compute similarity (e.g., cosine similarity)
- Score based on semantic closeness
Metrics: BERTScore, MoverScore
Advantages:
- Captures meaning over appearance
- Less dependent on exact phrasing
Limitations:
- Quality depends on embedding algorithm
- Requires computational resources
Introduction to Embeddings
Definition: Numerical representations capturing data meaning
Key Properties:
- Vector format (typically 100-10,000 dimensions)
- Similar content → closer embeddings
- Can be cross-modal (text, images, audio)
Popular Models:
- BERT, CLIP, Sentence Transformers
- Specialized vs. general-purpose embeddings
- Joint embedding spaces for multiple modalities
AI as a Judge
Overview
Definition: Using AI models to evaluate other AI models’ outputs
Advantages:
- Fast and relatively cheap
- Works without reference data
- Flexible evaluation criteria
- Can provide explanations
- Strong correlation with human evaluators (up to 85% agreement)
Implementation Approaches
- Quality Assessment: Evaluate response quality independently
- Reference Comparison: Compare against ground truth
- Pairwise Comparison: Compare two responses directly
Prompting Best Practices
Essential Components:
- Clear task description
- Detailed evaluation criteria
- Scoring system specification
- Examples with justifications
Scoring Systems:
- Classification (good/bad, relevant/irrelevant)
- Discrete numerical (1-5 scale)
- Continuous numerical (0-1 range)
Tips:
- Language models work better with text than numbers
- Discrete scoring outperforms continuous
- Narrower ranges (1-5) work better than wider ranges
- Include examples for better performance
Limitations and Biases
Inconsistency Issues
- Probabilistic nature leads to varying outputs
- Same judge, same input can produce different scores
- Reproducibility challenges
Criteria Ambiguity
- No standardized metrics across tools
- Different tools use different scoring systems
- Scores not comparable between judges
- Evolution over time complicates monitoring
Cost and Latency Concerns
- Doubles API calls if same model generates and evaluates
- Multiple criteria multiply costs
- Production guardrails add latency
- Mitigation: weaker judge models, spot-checking
Common Biases
- Self-bias: Models favor their own responses
- Position bias: Favor first option in comparisons
- Verbosity bias: Prefer longer responses regardless of quality
- Recency bias: Humans prefer last option (opposite of AI)
Judge Model Selection
Stronger Models as Judges:
- Better judgment quality
- Can guide weaker model improvement
- Challenges: cost, latency, no judge for strongest model
Self-Evaluation:
- Useful for sanity checks
- Can prompt self-improvement
- Subject to self-bias
Weaker Models as Judges:
- Judgment may be easier than generation
- Cost-effective option
- Specialized judges can outperform general ones
Specialized Judge Types:
- Reward Models: Score (prompt, response) pairs
- Reference-based Judges: Compare against ground truth
- Preference Models: Predict user preferences between options
Ranking Models with Comparative Evaluation
Approaches to Model Ranking
Pointwise Evaluation: Evaluate each model independently, rank by scores
Comparative Evaluation: Direct model comparisons, compute ranking from results
Why Comparative Evaluation?
Advantages:
- Easier to compare than assign absolute scores
- Captures subjective quality preferences
- Harder to game than benchmark training
- Never becomes saturated
- Provides discriminating signals
Applications:
- LMSYS Chatbot Arena leaderboard
- Production model selection
- A/B testing alternative
Implementation Process
- Match Generation: Select model pairs for comparison
- Evaluation: Human or AI judges pick winners
- Rating Algorithm: Compute rankings from match results
- Popular Algorithms: Elo, Bradley-Terry, TrueSkill
Challenges and Limitations
Scalability Issues
- Quadratic growth in comparisons (n models = n² pairs)
- Data-intensive requirements
- New model evaluation affects existing rankings
- Private model evaluation difficulties
Quality Control Problems
- Crowdsourcing lacks standardization
- Volunteer evaluators may lack expertise
- Simple prompts dominate (“hello” accounts for 0.55% of data)
- Malicious voting can pollute rankings
Interpretation Limitations
- Relative vs. absolute performance confusion
- Win rate doesn’t predict actual performance improvement
- Cost-benefit analysis requires absolute metrics
- Multiple scenarios possible for same ranking
Future Considerations
Promising Aspects:
- Scales with model capability improvements
- Captures human preference directly
- Resistant to gaming
- Complements other evaluation methods
Improvement Areas:
- Better matching algorithms
- Quality control mechanisms
- Standardization efforts
- Integration with absolute performance metrics
Summary and Best Practices
Key Takeaways
- Multi-Modal Approach: Use combination of exact, subjective, and comparative evaluation
- Context Awareness: Evaluation must consider the whole system, not just individual components
- Systematic Methods: Move beyond ad-hoc approaches like eyeballing results
- Understand Limitations: Each method has specific biases and constraints
- Iterative Improvement: Evaluation methods should evolve with applications
Evaluation Strategy Framework
- Identify Failure Points: Design evaluation around likely system failures
- Choose Appropriate Methods: Match evaluation approach to task characteristics
- Implement Multiple Measures: Combine exact and subjective evaluation
- Monitor and Adapt: Track evaluation reliability over time
- Cost-Benefit Balance: Optimize between evaluation thoroughness and resource constraints
Future Directions
- Standardization of AI judge criteria
- Development of specialized evaluation models
- Better integration of evaluation into development workflows
- Improved handling of evolving model capabilities
- Enhanced quality control for comparative evaluation
This comprehensive approach to AI evaluation provides the foundation for building reliable, systematic evaluation pipelines that can handle the complexity and open-ended nature of modern foundation models.