Retriever Optimization Strategies for Successful RAG

As interest in Retrieval-Augmented Generation (RAG) grows, Allganize’s RAG team shares optimization strategies, including hybrid search and case-specific techniques, to improve retriever performance. Their RAG leaderboard helps users evaluate solutions, ensuring accurate retrievals and reducing hallucinations in LLMs.

As interest in RAG (Retrieval-Augmented Generation) grows due to its proven ability to significantly reduce hallucinations in LLMs, there is increasing concern about which RAG solution can best retrieve candidate answers and generate accurate responses. Allganize's RAG team shares strategies for optimizing high-performance retrievers. These strategies include hybrid search, which combines both keyword-based and semantic search, as well as case-specific optimization techniques. These methods aim to improve retrieval accuracy and ensure high-quality responses in various scenarios.

As interest in RAG (Retrieval-Augmented Generation) grows due to its potential to reduce hallucinations in LLMs, there is increasing concern about which RAG solution can effectively retrieve candidate answers and generate accurate responses.

Allganize has launched a RAG leaderboard to support the evaluation of RAG performance,, helping users find the most suitable solutions for their needs.

The RAG leaderboard was developed by Allganize’s RAG team, a group dedicated to researching and providing high-performance RAG solutions to clients. Lee Jeong-Hoon, the head of the RAG team, shares insights on retriever optimization strategies, which play a crucial role in determining the overall performance of RAG solutions.

In RAG (Retrieval-Augmented Generation), the most significant factor affecting the performance of responses is the retrieval process. The accuracy and relevance of the retrieved information directly influence the quality of the generated answers.

While developing Alli, we realized that providing accurate context in the prompt results in generating precise answers. Since LLMs undergo in-context learning during training, they are already quite effective at generating responses based on the given context.

The common AI principle, "Garbage in, garbage out," applies to RAG as well—if the context provided is accurate, the generated answer will be accurate; if not, the answer may be misleading or incorrect.

When accurate context related to a question is provided, a precise answer is generated. However, if unrelated context is included, the answer may contain hallucinations.

So, how can we accurately retrieve the relevant context from a large amount of data?

For instance, finding the relevant part of a 10-page document is relatively easy. But when dealing with 50 documents, each ranging from 50 to 300 pages, it becomes much more challenging. If the documents are from the same domain, with similar content across pages, identifying the correct context becomes even harder.

Let me explain step-by-step how Allganize's Alli constructs its retrieval process to address this challenge.

Chunking Text into Appropriate Sizes

One of the first challenges in constructing retrieval for RAG is determining the size of the chunks. Since it’s difficult to process the entire text of a document at once, we break it into smaller units, called "chunks."

For example, imagine splitting the context into 300-token chunks. While this method divides the text into manageable pieces, there is a risk of losing surrounding information, which can lead to less accurate answers. Ideally, retrieval performance would be perfect, capturing all the relevant context and generating precise responses, but realistically, achieving 100% retrieval accuracy is difficult.

For instance, if you divide a page into 200-token chunks and receive a question like “Summarize long-term market interest rates by month,” the relevant paragraph on the right contains 800 tokens. To answer the question correctly, you need to reference the entire 800-token paragraph, which may not fit neatly within one chunk. In such cases, insufficient context can hinder the accuracy of the generated response.

Given situations like the example above, Alli structures its chunks based on document pages. This approach allows it to handle a broader range of questions by providing larger portions of context. However, this also limits the number of context pieces that can be included in the prompt, as there are constraints on how much context can fit.

While there is no single correct method, internal testing showed that approaching the text at the page level yielded better performance.

So, how do we identify the relevant chunks from the many available ones? The key lies in efficient retrieval methods that can accurately select the most relevant chunks for the question.

Keyword Search for Exact Word Matches

Allganize has been heavily invested in retrieval testing long before the advent of ChatGPT. In fact, Alli initially used an extractive model to provide answers to user questions before generative models significantly improved. The extractive model, using BERT, identifies the start and end positions of the correct answer within the chunks. While this approach lacked the flexibility of generative models in providing diverse responses, it outperformed generative models before 2023 in terms of accuracy.

Similarly, Alli’s extractive model needed to retrieve the relevant chunk for the question. To do this, the keyword search method was employed. Using ElasticSearch and the BM25 algorithm, this search method prioritized chunks where there were more exact word matches relative to the length of the sentence.

As shown in the example below, the more exact word matches there are in proportion to the length of the sentence, the higher the ranking of the chunk. This helped improve retrieval accuracy based on keyword relevance.

The strength of Keyword Search lies in its simplicity and ease of implementation. Being intuitive means that performance debugging is straightforward—you can fine-tune the search by assigning weights to certain keywords or making other detailed adjustments.

However, the downside is that keyword search cannot capture synonyms. For instance, if a query includes "Macintosh" but the document mentions "Apple computer," no matching keyword will be found, resulting in a score of zero for all related documents like Doc1, Doc2, and Doc3.

So, was Alli's extractive model completely unable to handle synonyms? Not entirely. Alli employed a user dictionary to address synonym detection. However, this approach had its limitations, as it’s not feasible to register all possible synonyms in the dictionary. Therefore, while it improved some aspects, this method was not comprehensive in handling synonym variations.

Semantic Search for Handling Synonyms

As mentioned earlier, Keyword Search has limitations when it comes to handling synonyms. Semantic Search addresses this issue by focusing on the meaning of words rather than just exact matches. Here’s how semantic search works:

Instead of matching specific words, semantic search uses machine learning models, like BERT-based embeddings, to convert both the query and the documents into vector representations that capture their meaning. The system then calculates the cosine similarity between these vectors to determine how closely the meanings align. This allows the search to recognize that terms like "Macintosh" and "Apple computer" are related, even though the exact words don't match.

By focusing on the semantic relationship between words, semantic search can effectively handle synonyms and improve the relevance of the retrieved chunks, making it a powerful alternative to keyword-based search.

Alli's extractive model initially considered adopting Semantic Search, but due to the limitations of embedding models, such as token count restrictions and performance issues, it was not implemented.

The system used a Transformer Encoder-based model to convert text into dense vectors for semantic search. However, these models had a maximum token limit of 512, which proved insufficient since Alli processes chunks at the page level, requiring the ability to handle over 1,000 tokens per chunk.

Models that use sparse attention mechanisms, like Longformer and Bigbird, which reduce computational load for longer token sequences, were also evaluated. Unfortunately, these models did not perform well in terms of accuracy or efficiency, and thus, were not adopted for Alli’s retrieval system.

Keyword + Semantic Search: Hybrid Search

So, is Alli still unable to handle synonyms effectively? Not anymore. Currently, Alli's generative model uses a Hybrid Search, combining both Keyword Search and Semantic Search.

For Keyword Search, the BM25 algorithm is used, as previously explained.
For Semantic Search, while earlier there were limitations with token count, recent models with high performance, capable of handling 4k–8k tokens, have been released. These include models like OpenAI Embedding, nomic-embed-text, bge-m3, and gte-large-en, all supporting up to 8192 tokens and showing excellent performance.

However, internal tests showed that using OpenAI Embedding alone didn’t yield the best results. To improve performance, Alli combined several scores:

BM25 score from Keyword Search,
Cosine similarity between the query and the document's title (converted into dense vectors),
Cosine similarity between the query and the document's page (also converted into vectors).

These three scores were normalized and combined to calculate a final score. The top N chunks with the highest scores were then input into the prompt for generating responses.

Here’s a simple example to illustrate this process: [Provide relevant example]. This method improves retrieval accuracy by leveraging both exact matches and semantic understanding.

Retriever Optimization Strategy

What if a project contains documents with meaningless titles like "ver1.pdf," "ver1_1.pdf," or "final_version.ppt"? Or, what if a user asks a question in Korean, but the uploaded document is in English?

To handle these situations, Allganize’s Alli applies simple retriever optimization techniques:

For meaningless titles: In cases where the document titles provide no relevant information, the weight of the Title Vector can be set to 0, preventing it from affecting the final score.
For language mismatches (e.g., a Korean question and English documents): Here, keyword search becomes ineffective. Instead, increasing the weight of the Page Vector and reducing the BM25 score can improve performance.

Once the user submits a query, Allganize parses the document’s paragraphs, tables, and charts. Then, it uses optimized retrieval methods to input the relevant context into Allganize’s financial LLM model, Alfi, ensuring accurate responses are provided to the user.

For enterprises to safely use LLMs without worrying about hallucinations, RAG optimization strategies are essential.

If you’re interested in applying rapidly evolving RAG solutions effectively, Allganize is here to help. Feel free to request a demo.

‍

What Is Enterprise Search? The True Backbone of Intelligent Knowledge Workflows

Allganize at InfoTech LIVE 2025: Navigating Uncertainty with Agentic AI for a Data-Driven Future

What Is MCP (Model Context Protocol)? Understanding Its Role in AI Infrastructure

Retriever Optimization Strategies for Successful RAG