AI Engineering Architecture and User Feedback

08 June 2025 - 6 mins read time
Tags: AI Engineering Architecture Feedback

AI Engineering Architecture and User Feedback Study Guide

Overview

This chapter explores how to combine various AI engineering techniques into production-ready applications. It covers system architecture design and user feedback collection for continuous improvement through data flywheels.

AI Engineering Architecture

Core Principle

Start simple and gradually add complexity. The basic flow: Query → Model → Response, then progressively enhance based on needs.

Step 1: Enhance Context

Purpose: Give models necessary information to produce quality outputs

Components:

Retrieval mechanisms: Text, image, tabular data retrieval
Tool integration: Web search, APIs, external data sources
Context construction: Feature engineering for foundation models

Implementation considerations:

Provider limitations (document types, upload limits)
Retrieval algorithms and configurations
Chunk sizes and retrieval quality

Step 2: Implement Guardrails

Purpose: Mitigate risks and protect users/systems

Input Guardrails

Protect against:

Private information leaks to external APIs
Malicious prompt injection attacks
System compromise through bad prompts

Sensitive data detection:

Personal information (IDs, phone numbers, bank accounts)
Human faces
Intellectual property keywords
Company privileged information

Mitigation strategies:

Block entire queries
Mask sensitive data with placeholders
Use PII reverse dictionaries for unmasking

Output Guardrails

Quality failures:

Malformatted responses (invalid JSON)
Factual inconsistencies/hallucinations
Generally poor outputs

Security failures:

Toxic content (racist, sexual, illegal)
Private information exposure
Remote code execution triggers
Brand risk responses

Handling strategies:

Simple retry logic for probabilistic failures
Parallel calls to reduce latency
Human fallback for complex cases
Sentiment-based escalation

Trade-offs:

Reliability vs latency
Stream completion compatibility
Self-hosted vs API requirements

Step 3: Add Model Router and Gateway

Router Functions

Intent classification:

Route queries to specialized models
Cost optimization through model selection
Out-of-scope conversation prevention
Ambiguous query clarification

Implementation:

Smaller models (GPT-2, BERT, Llama 7B)
Fast and cheap operation
Context adjustment for varying model limits

Gateway Benefits

Unified interface: Single point for multiple model APIs Access control: Centralized security and cost management Fallback policies: Handle rate limits and API failures Additional features: Load balancing, logging, analytics

Popular solutions: Portkey, MLflow, Wealthsimple, TrueFoundry, Kong, Cloudflare

Step 4: Reduce Latency with Caches

Exact Caching

Use case: Identical query matching
Implementation: In-memory storage, databases (PostgreSQL, Redis)
Eviction policies: LRU, LFU, FIFO
Considerations: User-specific vs generic queries, time-sensitivity

Warning: Improper caching can cause data leaks between users

Semantic Caching

Mechanism: Vector similarity matching
Process: Query embedding → vector search → similarity threshold
Requirements: Quality embeddings, functional vector search, reliable similarity metrics
Trade-offs: Higher hit rates vs accuracy risks and computational overhead

Step 5: Add Agent Patterns

Complex workflows: Loops, parallel execution, conditional branching Write actions: Email composition, order placement, bank transfers Risk considerations: Significantly increased system exposure

Monitoring and Observability

Key Performance Indicators

MTTD: Mean Time to Detection
MTTR: Mean Time to Response
CFR: Change Failure Rate

Metrics Categories

Format Failures

JSON validation errors
Expected key presence
Output structure compliance

Quality Metrics

Factual consistency
Conciseness, creativity, positivity
AI judge evaluations

Safety Metrics

Toxicity detection
PII exposure
Guardrail trigger rates
Refusal rate monitoring

User Behavior Metrics

Generation stop frequency
Conversation turn averages
Token usage patterns (input/output)
Response diversity measures

Performance Metrics

TTFT: Time to First Token
TPOT: Time Per Output Token
Total latency: Complete response time
TPS: Tokens Per Second
Cost tracking: API usage and rate limits

Logs and Traces

Comprehensive logging:

All configurations and settings
Complete query transformation pipeline
Intermediate outputs and tool calls
Component start/end times and crashes

Trace requirements:

Step-by-step query transformation
Failure point identification
Time and cost attribution per step

Drift Detection

Monitor changes in:

System prompts and templates
User behavior patterns
Model API versions
Performance metrics over time

AI Pipeline Orchestration

Two-Step Process

Component definition: Models, databases, tools, evaluation/monitoring systems
Chaining: Function composition and data flow specification

Design Considerations

Parallel processing: Minimize latency through concurrent operations
Data format compatibility: Ensure step-to-step data flow
Error handling: Component failure and data mismatch management

Popular Tools

LangChain, LlamaIndex, Flowise, Langflow, Haystack

Evaluation Criteria

Component support: Current and future model/framework compatibility
Advanced features: Branching, parallel processing, error handling
User experience: API intuitiveness, documentation, community support
Performance: No hidden API calls or latency introduction

User Feedback Systems

Strategic Importance

Evaluation: Application performance monitoring
Development: Future model training data
Personalization: User-specific adaptations
Competitive advantage: Proprietary data collection

Feedback Types

Explicit Feedback

Thumbs up/down, star ratings
Yes/no problem resolution
Direct user responses to feedback requests

Implicit Feedback

User action inference
Application-specific behavioral patterns
Conversational interface advantages

Natural Language Feedback Signals

Early Termination

Response generation stops
App exits during interaction
Voice assistant interruption
User abandonment

Error Correction

“No, …” or “I meant, …” responses
Query rephrasing attempts
Specific correction instructions
Confirmation requests (“Are you sure?”)

User Edits

Direct response modifications
Strong preference signal generation
Preference data creation (original vs edited)

Complaints

Eight categories (from FITS dataset):

Accuracy issues
Relevance problems
Toxicity concerns
Length complaints
Detail deficiencies
General quality issues

Sentiment Analysis

Frustration, disappointment expressions
Conversation sentiment tracking
Voice volume analysis (call centers)
Model refusal rate monitoring

Conversational Action Feedback

Regeneration Signals

Response dissatisfaction indication
Multiple option comparison desire
Consistency checking behavior
Usage-based billing considerations

Conversation Organization

Delete: Strong negative signal
Rename: Good conversation, poor auto-title
Share: Context-dependent positive/negative
Bookmark: Positive retention signal

Conversation Metrics

Length: Application-dependent interpretation
Diversity: Token/topic variety measurement
Turn patterns: Efficiency vs engagement indicators

Feedback Collection Design

Timing Strategies

Initial calibration: Optional preference setting Failure moments: Error reporting and recovery options Low confidence: Uncertainty-driven feedback requests Success moments: Optional positive feedback collection

Collection Best Practices

Seamless integration: Workflow non-disruption
Easy provision: Minimal user effort
Ignorable design: Optional participation
Clear incentives: Explained usage and benefits

Design Examples

Midjourney: Action-based implicit feedback
GitHub Copilot: Tab acceptance/typing continuation
Comparative evaluation: Side-by-side response selection

Feedback Limitations

Common Biases

Leniency bias: Overly positive ratings to avoid conflict
Random feedback: Unmotivated user responses
Position bias: First option preference
Length bias: Longer response preference
Recency bias: Last-seen option preference

Degenerate Feedback Loops

Mechanism: Predictions influence feedback, which influences next model iteration Examples:

Video recommendation popularity reinforcement
Content type amplification (cat photos example)
Sycophancy development in models

Mitigation: Understand feedback limitations and potential biases before implementation

Key Takeaways

Progressive complexity: Start simple, add components as needed
System thinking: Problems often require multi-component solutions
Observability first: Design for failure detection and debugging
User feedback value: Critical for competitive advantage and improvement
Bias awareness: Understand and design around feedback limitations
Product-engineering convergence: AI engineering increasingly involves product considerations
Safety considerations: Each component addition increases potential failure modes