Why does the LLM not use my documents?
Tip: This page is a deep dive on how or why your LLM is not using your documents when asking questions.
Luckily, it's simple to solve for, actually! These ideas extend beyond just AnythingLLM.
We get this question many times a week, where someone is confused, or even upset the LLM does not appear to "just know everything" about the documents that are embedded into a workspace.
So to understand why this occurs we first need to clear up some confusion on how RAG (retrieval augmented generation) works inside of AnythingLLM.
This will not be deeply technical, but once you read this you will be an expert on how traditional RAG works.
LLMs are not omnipotent
Unfortunately, LLMs are not yet sentient and so it is vastly unrealistic with even the most powerful models for the model you are using to just "know what you mean".
That being said there are a ton of factors and moving parts that can impact the output and salience of an LLM and even to complicate things further, each factor can impact your output depending on what your specific use case is!
LLMs do not introspect
In AnythingLLM, we do not read your entire filesystem and then report that to the LLM, as it would waste tokens 99% of the time.
Instead, your query is processed against your vector database of document text and we get back 4-6 text chunks from the documents that are deemed "relevant" to your prompt.
For example, let's say you have a workspace of hundreds of recipes, don't ask "Get me the title of the 3 high-calorie meals". This LLM will outright refuse this! but why?
When you use RAG for document chatbots your entire document text cannot possibly fit in most LLM context windows. Splitting the document into chunks of text and then saving those chunks in a vector database makes it easier to "augment" an LLM's base knowledge with snippets of relevant information based on your query.
Your entire document set is not "embedded" into the model. It has no idea what is in each document nor where those documents even are.
If this is what you want, you are thinking of agents, which are coming to AnythingLLM soon.
So how does AnythingLLM work?
Let's think of AnythingLLM as a framework or pipeline.
-
A workspace is created. The LLM can only "see" documents embedded in this workspace. If a document is not embedded, there is no way the LLM can see or access that document's content.
-
You upload a document, this makes it possible to "Move into a workspace" or "embed" the document. Uploading takes your document and turns it into text - that's it.
-
You "Move document to workspace". This takes the text from step 2 and chunks it into more digestable sections. Those chunks are then sent to your embedder model and turned into a list of numbers, called a vector.
-
This string of numbers is saved to your vector database and is fundamentally how RAG works. There is no guarantee that relevant text stays together during this step! This is an area of active research.
-
You type a question into the chatbox and press send.
-
Your question is then embedded just like your document text was.
-
The vector database then calculates the "nearest" chunk-vector. AnythingLLM filters any "low-score" text chunks (you can modify this). Each vector has the original text it was derived from attached to it.
IMPORTANT!
This is not a purely semantic process so the vector database would not "know what you mean".
It's a mathematical process using the "Cosine Distance" formula.
However, here is where the embedder model used and other AnythingLLM settings can make the most difference. Read more in the next section.
-
Whatever chunks deemed valid are then passed to the LLM as the original text. Those texts are then appended to the LLM as its "System message". This context is inserted below your system prompt for that workspace.
-
The LLM uses the system prompt + context, your query, and history to answer the question as best as it can.
Done.
How can I make retrieval better then?
AnythingLLM exposes many many options to tune your workspace to better fit with your selection of LLM, embedder, and vector database.
The workspace options are the easiest to mess with and you should start there first. AnythingLLM makes some default assumptions in each workspace. These work for some but certainly not all use cases.
You can find these settings by hovering over a workspace and clicking the "Gear" icon.
Chat Settings > Prompt
This is the system prompt for your workspace. This is the "rule set" your workspace will follow and how it will ultimately respond to queries. Here you can define it to respond in a certain programming language, maybe a specific language, or anything else. Just define it here.
Chat Settings > LLM Temperature
This determines how "inventive" the LLM is with responses. This varies from model to model. Know that the higher the number the more "random" a response will be. Lower is more short, terse, and more "factual".
Vector Database Settings > Search Preference
By default, AnythingLLM will search for the most relevant chunks of text. For the majority of use cases this is the best option since it is very simple to run and very fast to calculate.
However, if you are getting bad results, you may want to try "Accuracy Optimized" instead. This will search more chunks of text and then re-rank them to the top chunks that are most relevant to your query. This process is slightly slower but will yield better results in almost all cases. This option is only available if you are using LanceDB as your vector database.
The reason this method is opt-in is because it is computationally more expensive and on slower machines it may take more time that the you are willing to wait. Like the embedder model, this model will download once on it's first use. This is a workspace specific setting so you can experiment with it in different workspaces.
From our testing, the reranking process will add about 100-500ms to the response time depending on your computer or instance performance.
Vector Database Settings > Max Context Snippets
This is a very critical item during the "retrieval" part of RAG. This determines "How many relevant snippets of text do I want to send to the LLM". Intuitively you may think "Well, I want all of them", but that is not possible since there is an upper limit to how many tokens each model can process. This window, called the context window, is shared with the system prompt, context, query, and history.
AnythingLLM will trim data from the context if you are going to overflow the model - which will crash it. So it's best to keep this value anywhere from 4-6 for the majority of models. If using a large-context model like Claude-3, you can go higher but beware that too much "noise" in the context may mislead the LLM in response generation.
Vector Database Settings > Document similarity threshold
This setting is likely the cause of the issue you are having! This property will filter out low-scoring vector chunks that are likely irrelevant to your query. Since this is based on mathematical values and not based on the true semantic similarity it is possible the text chunk that contains your answer was filtered out.
If you are getting hallucinations or bad LLM responses, you should set this to No Restriction. By default the minimum score is 20%, which works for some but this calculated values depends on several factors:
- Embedding model used (dimensions and ability to vectorize your specific text)
- Example: An embedder used to vectorize English text may not do well on Mandarin text.
- The default embedder is https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (opens in a new tab)
- The density of vectors in your specific workspace.
- More vectors = more possible noise, and matches that are actually irrelevant.
- Your query: This is what the matching vector is based on. Vague queries get vague results.
Document Pinning
As a last resort, if the above settings do not seem to change anything for you - then document pinning may be a good solution.
Document Pinning is where we do a full-text insertion of the document into the context window. If the context window permits this volume of text, you will get full-text comprehension and far better answers at the expense of speed and cost.
Document Pinning should be reserved for documents that can either fully fit in the context window or are extremely critical for the use-case of that workspace.
You can only pin a document that has already been embedded. Clicking the pushpin icon will toggle this setting for the document.