LLM Evaluation in Practice: Error Analysis and Reliable Agent Testing
Evaluating and debugging LLMs, eval-driven development, AI reliability β all sound straightforward until you actually try to do it in production. In this AI Tech Experts Webinar, Maciej Kurzawa, Machine Learning Engineer, walks through a practical approach to evaluating LLM agents based on real development workflows. How to evaluate LLM agents in practice: manual error analysis, binary scoring, and avoiding common evaluation anti-patterns. π The talk covers: β why treating agents like standard models or tests fails β a 4-step error analysis framework based on manual trace review β why evaluation requires looking at full conversations and tool usage β trade-offs of manual evaluation and how to make it manageable β why binary scoring outperforms continuous scales in practice β limitations of LLM-as-a-judge and early automation β common anti-patterns: generic metrics, outsourcing too early, over-β automation π The core idea: reliable LLM systems are built by analyzing real failures, not by optimizing synthetic metrics. If you have questions for Maciej, feel free to comment below. π π Check out our website: https://deepsense.ai/ π Linkedin: https://www.linkedin.com/showcase/applied-ai-insider 00:00 LLM and agent evaluation problem overview 01:28 Why standard evaluation approaches fail for agents 03:36 Error analysis framework for LLM systems 05:03 Trade-offs of manual evaluation and time investment 06:45 Improving evaluation: scoring, tools and process 10:34 Common evaluation mistakes and best practices #LLMEvaluation #AIEngineering #AIAgents #ProductionAI
Download
0 formatsNo download links available.