Back to Browse

Benchmark embedding models #2 - Extracting text from PDF documents

522 views
Oct 14, 2025
43:11

In this video, I'll show you how to effectively extract text from complex PDF documents, including scanned files, charts, and tables, by comparing traditional Python libraries with modern multimodal AI models. We'll start by demonstrating the limitations of popular Python libraries like PyMuPDF, which struggle to preserve document structure, handle scanned text, or interpret complex layouts correctly. You'll see side-by-side comparisons where traditional methods produce jumbled and incomplete text. Then, we'll explore the power of Vision Language (VL) models. I'll explain the key differences and trade-offs between using free Python libraries and VL models. By the end of this tutorial, you'll have a clear understanding of which tools are best suited for different PDF extraction tasks and how to implement them to build high-quality datasets. Here is the link to the GitHub repository: https://github.com/ImadSaddik/Benchmark_Embedding_Models Useful links: - OpenVLM Leaderboard: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard - Intelligent Document Processing Leaderboard: https://idp-leaderboard.org/ - llama.cpp: https://github.com/ggerganov/llama.cpp - Granite docling 258M: https://huggingface.co/ibm-granite/granite-docling-258M https://huggingface.co/ggml-org/granite-docling-258M-GGUF - Gemma3 12B: https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf - Google AI studio: https://aistudio.google.com/api-keys - Gemini API documentation: https://ai.google.dev/gemini-api/docs/rate-limits https://ai.google.dev/gemini-api/docs/pricing https://ai.google.dev/gemini-api/docs/files Don't forget to like, subscribe, and leave a comment if you have any questions or feedback! ⭐️ Contents ⭐️ (00:00) Introduction (02:03) Using Vision Language (VL) Models for Extraction (03:09) Python Libraries vs VL Models: The Trade-offs (04:47) Side-by-Side Test #1 (08:18) Side-by-Side Test #2 (10:39) Benefits of Using VL Models (12:30) Coding time (12:38) Extracting Text with PyMuPDF (Normal PDF) (15:26) Attempting Extraction with PyMuPDF (Scanned PDF) (16:15) Extracting Text with Proprietary VL Models (Gemini 2.5 Pro) (26:50) Using Granite docling 258M with Transformers (Slow) (32:59) Using llama.cpp for local inference (Fast) (40:33) Using Gemma3-12B with llama.cpp (42:13) Final Comparison: Gemini 2.5 Pro vs Granite docling 258M vs Gemma3 12B (42:54) Conclusion

Download

0 formats

No download links available.

Benchmark embedding models #2 - Extracting text from PDF documents | NatokHD