Benchmark embedding models #6 - How to statistically evaluate embedding models with python and ranx
In this video, we'll dive deep into how to properly benchmark and evaluate embedding models for your RAG (Retrieval-Augmented Generation) or search applications. We'll move beyond just looking at a single score and explore why statistical testing is crucial. You'll learn how a model that looks better might just be "lucky," and how to use statistical tests to determine if its performance difference is actually real and statistically significant. First, I'll manually walk you through the three most important metrics used in retrieval evaluation: Mean Reciprocal Rank (MRR), Recall@K, and Normalized Discounted Cumulative Gain (NDCG@K). I'll explain the math and intuition behind each one, showing you how to calculate them from scratch in Python. Then, I'll introduce you to ranx, a powerful Python library that automates this entire process. We'll refactor our manual code to use ranx to create "Qrels" (Query Relevance Judgments) and "Runs" (model scores). You'll see how to run a complete benchmark, including statistical tests like Fisher's Randomization Test and the Paired t-test, with just a single line of code. Finally, we'll analyze the ranx report, which gives us a detailed metrics table and a win-tie-loss comparison. This will help us definitively prove which embedding model is the best for our specific dataset based on statistically significant evidence. GitHub Repository: https://github.com/ImadSaddik/Benchmark_Embedding_Models Ranx documentation: https://amenra.github.io/ranx/ Timestamps: (00:00) Introduction (00:22) Two ways to compare: Manually vs. ranx library (01:21) Manual benchmark results table (02:05) Why we need statistical testing (04:31) Explaining Evaluation Metrics (05:14) Metric 1: Mean Reciprocal Rank (MRR) (08:31) Metric 2: Recall@K (10:12) Metric 3: Normalized Discounted Cumulative Gain (NDCG@K) (12:45) Introduction to Statistical Tests (16:05) The Null Hypothesis and p-value explained (17:46) Statistical tests available in ranx (19:27) Example: Paired t-test (20:35) Example: Randomization (Fisher's) Test (25:03) Code Walkthrough: Manual (38:56) Code Walkthrough: Ranx Benchmark (53:53) Conclusion
Download
0 formatsNo download links available.