Welcome back! In our journey to build the Docs-Reader, we've learned about:

How we store document information in a way that allows us to search by meaning.
How we convert text into numerical representations (vectors) that capture their meaning, enabling the vector database to work its magic.

Now, imagine you have a really long document, like a detailed technical manual or a full book. If you tried to convert the entire document into a single embedding, that embedding would have to capture the meaning of everything in the document. This becomes very difficult and less precise, especially for detailed searches. Also, most embedding models have a limit on how much text they can process at once.

What if you're only interested in a specific small detail buried deep within that long document? Searching the embedding of the entire document isn't efficient. You'd much rather search just the relevant parts.

What Problem Does Text Splitting Solve?

The core problem is that large documents are too big and complex to process effectively as a single unit, especially for creating precise embeddings and performing targeted searches.

Think of it like having a giant textbook and trying to find a specific sentence. You wouldn't re-read the entire book every time. You'd likely flip through chapters, then sections, then paragraphs until you find what you need.

We need a way for our system to do something similar: break the large document into smaller, more manageable pieces.

Enter Text Splitting (Chunking)

Text Splitting, often called Chunking, is the process of breaking down a long document into smaller segments or "chunks."

Instead of processing a whole document, we take each chunk and:

Create an embedding for it.
Store this chunk (and its embedding) in the vector database.

When you ask a question, we convert your question into an embedding and search the database for the most similar chunk embeddings. This way, we retrieve only the small pieces of the document that are most relevant to your specific question, not the whole thing.

Why is Chunking Important?

Better Embeddings: Embedding models work better on smaller, focused pieces of text. The resulting embeddings are more precise in capturing the meaning of that specific chunk.
Efficient Search: Searching through hundreds or thousands of small chunk embeddings is much faster and more accurate than searching a single, massive document embedding.
More Relevant Results: When the vector database returns relevant chunks, these are smaller, specific pieces of text that an AI model can use to formulate a focused and accurate answer to your question.

What Problem Does Text Splitting Solve?

Enter Text Splitting (Chunking)

Why is Chunking Important?

Key Concepts in Chunking