Blogs & Articles
>
Attempts to streamline Cross-Encoder for real-time generative AI services and optimization through engineering.
Blog
2/3/2025

Attempts to streamline Cross-Encoder for real-time generative AI services and optimization through engineering.

This blog explores optimizing Cross-Encoders for real-time generative AI services. It covers challenges like high computational costs, slower response times, and efficiency issues. Solutions include lightweight techniques—pruning, distillation, and quantization—combined with engineering optimizations, improving response speeds up to 30x while reducing costs and enhancing scalability.

1. Introduction: Challenges of Cross-Encoder in Real-Time Search Systems

1.1 Background

Reranking is a key process in search systems that determines the quality of search results, significantly impacting the end-user experience. Among reranking methods, the Cross-Encoder excels in modeling interactions between queries and documents, offering superior accuracy compared to traditional search models like BM25 and Bi-Encoder.

Cross-Encoders achieve state-of-the-art (SOTA) performance in various applications such as search engines, recommendation systems, and Retrieval-Augmented Generation (RAG), greatly enhancing information retrieval and recommendation quality. Their importance is particularly pronounced in tasks requiring complex queries or a deep understanding of context.

1.2 Necessity

However, the structural characteristics of Cross-Encoders pose significant challenges in real-time search systems:

To address these issues, model optimization through lightweight techniques is essential. Representative approaches include pruning, distillation, and quantization, with quantization standing out as a powerful tool for significantly improving speed and efficiency. These optimization techniques reduce the burden of the reranking stage and greatly enhance the scalability and efficiency of real-time search systems.Furthermore, applying deep learning models to real-world services requires consideration of various engineering factors, such as hardware selection, serving platform optimization, and task-specific adjustments.When combining quantization with diverse engineering optimizations, it is possible to achieve speed improvements of up to 10 times or more.2. Challenges of Applying Cross-Encoder to Real-World Services2.1 How Cross-Encoder Works

2.2 Structural Characteristics and Challenges of Cross-Encoder

2.3 Solutions to Cross-Encoder Challenges: Lightweight Techniques

3. Pruning: Reducing Weights and Model Structure

3.1 Concept and Principles

Pruning is a technique that reduces computational load and model size by removing less important components from the model.

In Transformer models, pruning can be applied to specific layers, attention heads, or feed-forward networks (FFN).
The goal of pruning is to maximize computational efficiency while minimizing performance loss.

3.2 Main Types

3.3 Advantages

3.4 Limitations

3.5 Limitations in Service Application

While pruning can be effective under specific conditions, hardware constraints and overhead issues must be carefully considered when applying it to real-world services.

4. Distillation: Model Compression Through Knowledge Transfer

4.1 Concept and Principles

Distillation is a technique that transfers the knowledge of a large model (Teacher) to a smaller model (Student), reducing the model size and significantly decreasing computational load while maintaining accuracy.

The Student model learns from the outputs (soft labels) or intermediate representations generated by the Teacher model, enabling the development of a lightweight model with performance similar to the Teacher.

The reason for using soft labels instead of hard labels (actual data labels) is that soft labels leverage the output information of the Teacher model, allowing for more diverse and enriched learning.

The image above represents both a "bird" and a "branch." While traditional hard labels consider it entirely incorrect if the image is not recognized as a "bird," soft labels generate outputs like 70% bird and 20% branch, allowing the model to learn more diverse information from the same image.

4.2 Main Types

4.3 Advantages

4.4 Limitations

4.5 Limitations in Service Application

While distillation is a powerful model compression technique, additional considerations are needed to address training time and domain diversity challenges.

5. Quantization: Model Compression Through Precision Reduction

5.1 Concept and Principles

Quantization is a technique that improves computational efficiency and reduces memory usage by converting the weights and activation values of deep learning models into lower precision formats.

By converting numbers represented in FP32 to FP16 or INT formats, it is possible to reduce both computational load and memory consumption.
In large language models, quantization helps alleviate computational bottlenecks and meets the demands of real-time search systems.

5.2 Main Types

5.3 Advantages

5.4 Use Cases of Quantization in Allganize Services

Allganize has significantly improved inference speed by applying Post-Training Quantization (PTQ) to convert FP32 weights and activation values to FP16 in Cross-Encoders.

5.5 Why Quantization Works Well in Deep Learning

The effectiveness of quantization stems from the fundamental design and learning characteristics of deep learning models, explained as follows:

Quantization leverages these characteristics to maximize model compression and efficiency in real-time services.

6. Engineering Factors for Providing Real-Time Services

To successfully deploy Cross-Encoder-based reranking in real-time search systems, model compression techniques alone are not sufficient.
Various engineering factors such as model serving, hardware optimization, and data transfer efficiency must also be considered.6.1 Hardware Selection: L4 GPU

6.2 Serving Platform: Triton Server

6.3 Perceived Speed Improvement Through Dynamic Padding

6.4 Quantization

6.5 Optimization Results

Optimization Results Summary

7. Conclusion

Cross-Encoder-based reranking provides high accuracy in search systems and has established itself as a powerful tool for modeling complex contextual interactions. However, these models can become major bottlenecks in real-time search systems due to high computational costs and memory usage. To address these challenges, lightweight techniques such as pruning, distillation, and quantization, along with various engineering optimizations, have been introduced, each with its own advantages and limitations.

Allganize has significantly improved the performance of real-time search systems by implementing not only model compression techniques but also engineering optimizations such as hardware optimization (L4 GPU), serving platform optimization (Triton Server), and Dynamic Padding.

Through these optimizations, the overall response speed has improved by more than 10 times, with perceived speed improvements of up to 20–30 times thanks to dynamic padding.Allganize is continuously developing new technologies to provide reliable and fast services. Moving forward, the company plans to research and adopt the latest technologies, such as INT quantization and embedding model optimization. In particular, since support for quantization in embedding models is currently limited, Allganize aims to independently develop lightweight technologies based on TensorRT and CUDA to deliver higher efficiency at lower costs.

These innovations are expected to strengthen the competitiveness of real-time search systems and serve as a foundation for enhancing the user experience.