Back to Browse

Benchmark embedding models #3 - Divide text into coherent chunks

143 views
Oct 25, 2025
22:02

In this video, I'll show you how to divide a large text document into smaller, semantically coherent chunks using Large Language Models (LLMs). This is the third video in my series on benchmarking embedding models. We'll start by discussing why simple programmatic chunking (e.g., by character count or paragraphs) often fails by breaking the contextual meaning of the text. The goal is to create high-quality, focused chunks, which is a crucial step for building effective Retrieval-Augmented Generation (RAG) systems. First, I'll walk you through a practical example using the Gemini API. You'll learn how to write a detailed system prompt that instructs the model to identify topics and split the text accordingly. We'll also use Pydantic to define a strict output schema, forcing the LLM to return a clean, parsable JSON array of text chunks. Next, we'll replicate the process using an open-source model, Gemma-3-4B, running entirely on a local machine with llama.cpp. I'll show you how to download the model from Hugging Face, start the local server, and adapt the code to send API requests. We'll also look at some of the limitations of running smaller models locally, especially concerning context window size and hardware constraints when dealing with very large documents. Here is the link to the GitHub repository: https://github.com/ImadSaddik/Benchmark_Embedding_Models Don't forget to like, subscribe, and leave a comment if you have any questions or feedback! ⭐️ Contents ⭐️ (00:00) Introduction to text chunking (01:02) Why programmatic chunking isn't a good idea (02:08) Different methods for dividing text into chunks (06:00) Method 1: Using the Gemini API (08:48) Using a Pydantic object to define the output structure (09:35) Generating chunks with the Gemini API (12:16) Inspecting the results and saving the chunks (15:01) Method 2: Using a local LLM (Gemma-3-4B) (15:09) How to download the Gemma model from Hugging Face (16:52) Running the model locally using the llama-server (18:19) Generating chunks with the local Gemma API (19:36) Final thoughts and comparison

Download

0 formats

No download links available.

Benchmark embedding models #3 - Divide text into coherent chunks | NatokHD