DeepDive into DeepSeek-V3: Evaluating the Future of AI Agents with Allganize’s LLM Platform

We evaluate DeepSeek-V3 using Allganize's LLM evaluation platform, highlighting its high performance and cost-efficiency. We also discuss key innovations in DeepSeek-V3, its agent capabilities, and how Allganize's platform streamlines LLM assessments, paving the way for smarter AI-driven decision-making in 2025.

As the landscape of AI agents evolves rapidly, DeepSeek-V3 has emerged as a standout model, praised for delivering exceptional performance at an impressively low cost. In this blog, we dive deep into our comprehensive evaluation of DeepSeek-V3, leveraging Allganize’s newly launched LLM evaluation platform. Beyond showcasing the results, we’ll also provide an inside look at the platform itself—designed to help businesses identify, analyze, and implement the most effective large language models (LLMs) tailored to their unique needs. At Allganize, our mission is to empower clients with the best AI solutions, and our LLM evaluation platform is a key part of that journey.

Agent Benchmark Evaluation Results for DeepSeek-V3

Recently, DeepSeek-V3, with 670 billion parameters, was released as open-source. This LLM excels in understanding and generating text, similar to ChatGPT and Gemini, but it has several key differences. Its primary goal is to maintain high performance while significantly reducing training costs.

Key Innovations in DeepSeek-V3

MLA (Multi-head Latent Attention):
A new attention mechanism that processes text efficiently during training and inference with fewer resources. Unlike traditional attention methods with high computational demands, MLA addresses these inefficiencies.
MoE (Mixture-of-Experts) Architecture:
This structure improves model performance by selecting the most suitable expert model for each input. It enhances performance without significantly increasing model size. To address load imbalance issues among experts, the model applies an Auxiliary-Loss-Free Load Balancing method for equal load distribution.
Additional Techniques:
- MTP (Multi-Token Prediction): Enhances training efficiency.
- FP8 (8-bit Floating Point): Reduces computational costs.

Benchmark Results

‍DeepSeek-V3 has shown outstanding performance across various benchmarks and is considered one of the most powerful open-source models available:

GPQA-Diamond (Google's Benchmark for Question Answering): Achieved 88.5% accuracy, outperforming DeepSeek-V2 (78.2%) and Qwen2.5-72B (85.3%), and comparable to Llama-3.1-405B (88.6%).
MATH 500: Scored 90.2% in mathematical problem-solving.
Codeforces: Achieved 51.6% in code generation and comprehension tasks.
NIAH (Needle In A Haystack): Demonstrated stable performance with context lengths up to 128k tokens.

Agent Performance Evaluation

‍There was no publicly available data on DeepSeek-V3's performance as an agent, so we conducted our own evaluation using Allganize's proprietary LLM evaluation platform. The results showed that DeepSeek-V3 performed similarly to or slightly below GPT4o-mini in agent tasks.The rapid evaluation of its agent capabilities was made possible by our platform, which is specifically designed to assess the agent functionality of LLMs effectively.

‍

Agent benchmark evaluation results from Allganize's LLM evaluation platform.

‍

New possibilities brought by LLM-based agents

Agents have long been considered a crucial approach to achieving AGI (Artificial General Intelligence), which aims to realize general intelligence in artificial systems. An agent must possess the ability to plan independently, make autonomous decisions, and solve problems across various tasks with intelligence akin to humans.

Before the advent of LLMs, most AI models were trained within limited environments on relatively simple problem scenarios. Early AI models functioned efficiently only within specific domains or constrained conditions, lacking the capability to handle complex problems beyond those boundaries. These limitations posed significant barriers for agents to make human-like decisions and solve problems across diverse domains.

However, recent advancements in LLMs have shown potential to overcome these limitations. LLMs have demonstrated human-like intelligence based on the knowledge and abilities required to solve problems in diverse environments. They have moved beyond merely answering simple questions, laying the foundation for agents capable of autonomously performing complex tasks. As a result, research to build agents that make decisions like humans has gained momentum, and in various fields, LLM-based agents are expected to serve as automated systems that efficiently address complex challenges.

How LLMs perform as agents

An LLM can essentially be thought of as a function that generates appropriate responses to user-provided prompts. However, to solve complex problems, an LLM must go beyond simply generating answers and develop the capability to actually resolve issues. This requires enabling the LLM to act independently. By predefining various tools tailored to the problem at hand, providing information about each tool to the LLM, and allowing the LLM to select and utilize the appropriate tools, it can function as an agent. This approach, known as "tool calling," refers to the ability of an LLM to invoke external tools to solve more complex tasks.

Additionally, as an agent, an LLM must possess the ability to remember and understand the flow of a conversation through continuous interaction with users. Accurately identifying user needs and taking corresponding actions is a critical function that goes beyond simple Q&A. To effectively perform as an agent, an LLM must combine knowledge of various domains, the ability to call tools for problem-solving, an understanding of conversational context, and the capability to appropriately utilize acquired information. Only when these abilities are integrated can an LLM successfully act as a proactive and autonomous agent.

Learning methods to enhance agent capabilities

Allganize has researched and developed training methods for a LLM based on the Qwen2.5 3B Instruct model, achieving excellent performance on benchmarks.

‍

‍

Repeated training of stage models showed continuous performance improvements compared to the Qwen2.5 3B Instruct model.

This raised an important question: Can a high-performing LLM effectively function as an agent? The answer is that not all LLMs can perform the role of an agent effectively. Simply understanding requests and generating responses is not enough; tool calling abilities, understanding conversational context, and autonomous reasoning to solve complex problems are also required. Thus, basic language modeling capabilities alone are insufficient, necessitating additional training and fine-tuning to address these gaps.

‍

FunctionChat benchmark evaluation results

‍

The Qwen model we selected already had some degree of tool calling capability. However, upon examining the detailed performance metrics of the Stage models, scores for completion (answer generation), slot (asking for missing information), and relevance (refusing tasks it cannot perform) were maintained or improved. These categories are characterized by natural language output. On the other hand, the call category, which requires responses in a specific format, showed a decline in scores.

In conclusion, for an LLM to properly function as an agent, it must go beyond simple language modeling to accurately understand context, effectively utilize appropriate tools, and make autonomous judgments. This enables the resolution of complex problems and requires specialized training data and ongoing tuning. Allganize is committed to developing models that can excel in performing as agents.

Efforts to evaluate LLMs that excel in performing as agents

Allganize is deeply focused on understanding how effectively existing LLMs can operate as agents to develop better-performing models. To this end, we have built an evaluation platform using various benchmarks to assess LLM capabilities and identify which models and training methods are best suited to enhancing agent performance.

For an LLM to succeed as an agent, it must go beyond simply responding to requests. It should select and execute appropriate tools, make meaningful decisions in complex conversational contexts, and work effectively in environments that involve multiple tools and diverse domains. This includes handling not only natural language queries but also solving problems that require multi-step interactions. Using these criteria, we comprehensively evaluated LLMs’ agent performance with three public benchmarks.

First, we used the BFCL V3 benchmark to assess LLMs' tool calling capabilities in various scenarios. BFCL tests multi-turn conversations, evaluating:

Base Multi-Turn: Basic tool calling ability.
Missing Parameters: Generating appropriate questions when required information is missing.
Missing Functions: Providing suitable responses when a necessary tool is unavailable.
Long-Context Multi-Turn: Extracting key information from long conversational contexts to call the correct tool.
This benchmark allowed us to determine how well the models could select and utilize tools accurately even in complex contexts.

To evaluate tool calling abilities, we used Kakao's FunctionChat-Bench, which consists of two components:

SingleCall: Assesses how accurately the model selects the correct tool from a given list, considering the number of tools and their similarity.
Dialog: Tests the model’s ability to generate accurate tools and arguments, request missing information, and suggest appropriate alternatives in various conversational scenarios.

Allganize Agent Benchmark on the LLM Evaluation Platform

‍

Allganize's LLM Evaluation Platform

Allganize's LLM evaluation platform analyzes how effectively LLMs can perform as agents through agent evaluation benchmarks, enabling the identification of models best suited for our agent services. Additionally, the platform comprehensively evaluates general language understanding, knowledge levels, and instruction-following capabilities. It employs over 10 benchmarks to assess models from multiple perspectives, focusing on improving both efficiency and convenience throughout the process.

In the past, evaluating new models required running separate benchmark codes for each benchmark, a time-consuming and labor-intensive process that involved repetitive tasks for every model. To address this, Allganize developed a platform that automates the evaluation process, making it more accessible for developers. Now, with a dashboard interface, developers only need to input the name of a newly developed model. The platform automatically integrates the model's API, runs benchmark evaluations, and analyzes the results for the user. This streamlined process not only simplifies evaluations but also allows developers to focus more on research and development.

One significant issue with previous LLM benchmark evaluations was the lengthy evaluation time. Improvements were made to ensure efficient evaluation even with large datasets and complex benchmarks. For example, the ArenaHard benchmark, which previously took about 1 hour and 30 minutes to evaluate a single model, now takes only about 20 minutes on the LLM evaluation platform.

‍

Automated Evaluation Page of the LLM Evaluation Platform

‍

On April 9, 2024, Allganize unveiled a financial-specialized LLM leaderboard. This has now been replaced with a new LLM evaluation platform that assesses the agent capabilities of LLMs. We plan to continuously update and enhance the new LLM evaluation platform introduced in today’s newsletter, as Allganize remains committed to advancing LLMs.

As we continue to push the boundaries of what’s possible, stay tuned for more insights and updates in the months ahead.

Here’s to a productive and AI-powered 2025! If you want to know more contact us today.

‍

The emergence of DeepSeek-R1, and what we must not overlook – Part 2

The Emergence of DeepSeek-R1 and What We Must Not Overlook – Part 1

Case Study: AI Work Partner Capable of Creating Mini-Games in a Chatbot – Smilegate Megaport

DeepDive into DeepSeek-V3: Evaluating the Future of AI Agents with Allganize’s LLM Platform