Strategies for Addressing Hallucinations in Generative AI

In order to reduce the illusion of generative AI, there are increasing services that indicate the sources that serve as the basis for answers. However, if the document is long and complicated, it is difficult to find the basis properly with only the source. Allganize's Alli highlights and displays the basis. Allganize's RAG team leader Lee Jeong-hoon explains why the highlighting function is necessary and how to implement it.

Many of you have likely tried various LLM-based services such as ChatGPT, Claude, and Gemini, using them to summarize documents, find answers within documents, or ask questions about the content.

Recently, to reduce hallucinations in generative AI models, more services have started providing source citations along with their answers. At Allganize, not only does Alli provide the sources for answers, but it also highlights the specific sections of the document that support the answer.

Lee Jung-hoon, the head of Allganize’s RAG team, will explain in detail why this highlighting feature is necessary and how it is implemented.

1. Why Highlighting is Necessary

One of the major issues with generative models is hallucination, where the model provides inaccurate or fabricated information. Because of this, users often cannot fully trust the responses generated by these models. While some have attempted to address the hallucination issue by rethinking the model training process, no perfect solution has been found yet.

To improve trustworthiness, many generative AI products now present the evidence along with the responses. By showing the sources that support the answer, these models can enhance the reliability of their outputs.

The image below shows an example of a response from Perplexity, where the inclusion of sources alongside the answer increases its credibility.

‍

‍

Allganize's Alli also provides sources for its answers in a preview format. As seen in the image below, references to the supporting documents are marked as [1], [2], [3]. When you click on [2], for example, it shows a preview of the page that contains the evidence for the given answer. This feature allows users to quickly verify the accuracy of the response by reviewing the relevant document.

‍

‍

However, when a page contains a large amount of text, it can take users a considerable amount of time to read and understand the content. This is especially true in complex and unfamiliar domains like finance and law, where understanding the material can be even more challenging. For instance, a customer unfamiliar with the financial domain may find it difficult to quickly read and comprehend a page like the one shown below.

‍

‍

2. What is Highlighting?

To solve the issue of lengthy and complex pages, highlighting the specific portions of the document that serve as the basis for the answer can significantly speed up the process of reviewing the content. Allganize's Alli highlights the relevant sections in the preview with a yellow background. The highlighting model developed by Allganize supports various languages, including English, Spanish, Korean and Japanese, among others. When evaluated using test data, the model demonstrated an accuracy of 82%.

For instance, when asked, "What qualifications are required to register as a financial product sales agent or intermediary?" Alli generated the following response.

‍

‍

As you can see in the image below, the relevant sections that support the answer are highlighted, making it easier to identify the key information quickly.

‍

‍

3. Implementation Method

3-1. Data
The data used for implementing the highlight feature is structured as follows:

The input text consists of a set containing Context, Query, and Answer.

‍

‍

The label is composed of 0s and 1s. If the answer can be derived from the corresponding paragraph, it is labeled as 1; if not, it is labeled as 0. The data used combines open MRC (Machine Reading Comprehension) data from AI-Hub with data that we created in-house.

‍

‍

3-2. Query and Document Embedding

In the first method, we simply embedded both the Query and Context using Universal Sentence Embedding (USE). The context with the highest similarity score was selected as the basis for the answer. The context consists of tokens, where the text on a page is divided into specific lengths. The advantage of this method is that it uses a Bi-Encoder approach, which results in lower latency. In a Bi-Encoder approach, the context is pre-embedded and stored in the database, so when a query is entered, only the query needs to be embedded.

In the second method, instead of using just the Query, we combined Query + Answer to create the embedding and then compared it with the context. Since the query information can be limited, our hypothesis was that adding the answer information would increase the likelihood of finding the correct evidence.

After conducting experiments, we confirmed that the method that included the answer performed better. The experiment was conducted using four types of test data, and the results are presented as F1 Scores.

‍

‍

3-3. Embedding

Based on the previous experiments, we compared performance using different embedding models. While USE offers the advantages of fast processing and no token limit, its embedding performance is not optimal. Therefore, we tested how much performance could improve by using the higher-performing BGE M3 embedding model. BGE M3 supports up to 8,000 input tokens, reducing the burden of context length. Additionally, BGE M3 has demonstrated the best performance on Allganize’s internal Retrieval Leaderboard. (The top-performing model was BAAI/bge-multilingual-gemma2, but it is an sLLM model, making it impractical for real-world use.)

As shown in the table below, the performance improved significantly when using the BGE M3 model.

‍

‍

3-4. Overlap

In the previous experiments, the context was split into sections based on a specific length. However, this approach can lead to information loss within the context. To address this, we evaluated performance by overlapping the context segments.

When evaluated with USE, we found that performance improved when using overlapping context compared to the non-overlapping method.

‍

‍

3-5. BERT

As with the previous experiments using the Bi-Encoder approach, performance was not as high as expected. Therefore, we shifted to a Cross-Encoder approach, sacrificing some latency for improved performance. In a Cross-Encoder, both the Context and Query are input into the model together, requiring inference for the entire context every time a query is submitted.

To reduce the processing time, we applied this method only to pages filtered through the retriever instead of the entire context. Although this eased the latency burden, it was still time-consuming, so we experimented with the relatively lightweight BERT model. BERT’s maximum token limit is 512, which can be restrictive, but since the context and answers are generally short, most of them fit within this limit.

From this point on, we used a different set of evaluation data, meaning the only comparable dataset with previous methods is the QASPER data. Even with this constraint, we observed that performance improved significantly compared to other methods.

‍

‍

3-6. Multilingual

Allganize operates not only in the U.S. but also in Korea and Japan, making it essential to develop a highlight model that supports multiple languages. When we initially evaluated the model using actual customer data, the performance was low, as shown below. To improve this, we generated additional data and translated it into Japanese and Korean. After retraining the model with this multilingual data, we observed a significant improvement in performance, as reflected in the results below.

‍

4. Future Work

At Allganize, we aim to improve the performance of the highlight model to exceed 90%. We are currently conducting experiments using sLLM models instead of BERT, as well as evaluating the model with OpenAI’s GPT-4.0.

Allganize takes a careful approach to UI/UX and product development to maximize customer convenience. While the core of our RAG solution is to provide accurate answers, we acknowledge that performance cannot always be 100%. Therefore, we’ve implemented methods to compensate for these shortcomings from a service perspective.

If you are looking to apply a highly usable RAG solution, Allganize is here to help. If you’d like a demonstration or a demo of our RAG solution, please contact us.

‍

The emergence of DeepSeek-R1, and what we must not overlook – Part 2

The Emergence of DeepSeek-R1 and What We Must Not Overlook – Part 1

Case Study: AI Work Partner Capable of Creating Mini-Games in a Chatbot – Smilegate Megaport

Strategies for Addressing Hallucinations in Generative AI - Highlight Feature Implementation