Welcome to the Docs-Reader project tutorial! In this project, you're building a tool that can understand your documents and answer questions about them. To do this effectively, we need a smart way to store and search through the information in your documents. That's where a Vector Database comes in.
Think about a regular database, like the kind you might use to store customer names and addresses. You search it using exact matches or keywords. But what if you want to find documents that are about the same topic, even if they don't use the exact same words? This is a common challenge when dealing with text.
For example, if your document talks about "automobiles" and you search for "cars", a regular keyword search might miss it. But you know "automobiles" and "cars" mean roughly the same thing. How can a computer understand this?
The core problem is searching based on meaning, not just exact words. A standard database is great for structured data and keyword searches. But for finding relevant information in unstructured text, where different words can have similar meanings, we need a different approach.
Imagine you have a giant library (your documents) and you're looking for information about "space travel". Some books might use the word "astronauts," others "cosmonauts," "rockets," "spacecraft," or "orbital mechanics." A simple keyword search for "space travel" might miss many relevant books. You want a system that understands that all these terms are related to the concept of "space travel" and can find all relevant books.
This is the use case we want to solve: finding document parts that are semantically related to your question.
A Vector Database is a special type of database designed specifically for this kind of "meaning-based" search. Instead of storing your data directly as text or numbers in a traditional format, it stores a numerical representation of your data called an embedding.
A Vector Database stores these embeddings and is highly optimized for one main task: quickly finding embeddings that are most similar to a given query embedding.
Chroma DB is the specific Vector Database we use in the Docs-Reader
project. Its job is to:
The process involves two main phases: