Back to Browse

LLM Evaluation in Practice: Error Analysis and Reliable Agent Testing

134 views
Apr 16, 2026
15:17

Evaluating and debugging LLMs, eval-driven development, AI reliability β€” all sound straightforward until you actually try to do it in production. In this AI Tech Experts Webinar, Maciej Kurzawa, Machine Learning Engineer, walks through a practical approach to evaluating LLM agents based on real development workflows. How to evaluate LLM agents in practice: manual error analysis, binary scoring, and avoiding common evaluation anti-patterns. πŸ‘‰ The talk covers: – why treating agents like standard models or tests fails – a 4-step error analysis framework based on manual trace review – why evaluation requires looking at full conversations and tool usage – trade-offs of manual evaluation and how to make it manageable – why binary scoring outperforms continuous scales in practice – limitations of LLM-as-a-judge and early automation – common anti-patterns: generic metrics, outsourcing too early, over-– automation πŸ‘‰ The core idea: reliable LLM systems are built by analyzing real failures, not by optimizing synthetic metrics. If you have questions for Maciej, feel free to comment below. πŸ’­ πŸ”— Check out our website: https://deepsense.ai/ πŸ”— Linkedin: https://www.linkedin.com/showcase/applied-ai-insider 00:00 LLM and agent evaluation problem overview 01:28 Why standard evaluation approaches fail for agents 03:36 Error analysis framework for LLM systems 05:03 Trade-offs of manual evaluation and time investment 06:45 Improving evaluation: scoring, tools and process 10:34 Common evaluation mistakes and best practices #LLMEvaluation #AIEngineering #AIAgents #ProductionAI

Download

0 formats

No download links available.

LLM Evaluation in Practice: Error Analysis and Reliable Agent Testing | NatokHD