The DeepSeek Shock: A ‘Cost-Effective’ Language Model Challenging GPT

Optimium

Solution

Company

Resources

Contact

Select Language

Optimium

Solution

Company

Resources

Contact

Select Language

Insight

The DeepSeek Shock: A ‘Cost-Effective’ Language Model Challenging GPT

In this post, we’ll explore why DeepSeek has captured the market’s attention to this extent and how they managed to develop a high-performing LLM at such a low cost.

Sungmin Woo

February 20, 2025

Hello, this is Sungmin Woo from the ENERZAi Business Development Team. Recently, there’s been a lot of buzz surrounding an AI-based chatbot developed by a Chinese AI startup called DeepSeek. This product is built on a Large Language Model (LLM) that DeepSeek developed in-house, and it has garnered significant attention in the market for being trained at a much lower cost compared to leading LLMs such as OpenAI’s GPT or Meta’s Llama.

In this post, we’ll explore why DeepSeek has captured the market’s attention to this extent and how they managed to develop a high-performing LLM at such a low cost.

What is DeepSeek?

DeepSeek originated from the Gen-AI Lab founded by the Chinese hedge fund High-Flyer, which initially focused on AI algorithms for trading. The company was established in 2023 by Liang Wenfeng, a co-founder of High-Flyer. Starting with the release of DeepSeek Coder in November 2023, they have continuously introduced new models, such as DeepSeek LLM, V2, and V3. Notably, they have made all of the models’ code, weights, and training methods open-source, drawing even more attention.

In particular, the launch of an AI chatbot app based on DeepSeek-V3 on January 20 instantly sparked an explosive increase in interest toward DeepSeek. This AI assistant supports logical reasoning, real-time problem-solving, and is especially strong in math problem-solving and coding tasks. Reportedly, it can match the performance of OpenAI’s GPT-4o and Meta’s Llama 3.1. The most surprising aspect, however, is that its training cost is more than ten times cheaper than other LLMs of a similar scale.

According to DeepSeek, the DeepSeek-V3 model was trained in just 55 days using only 2,000 NVIDIA H800 GPUs, and the total training cost was a mere $5.57 million. Considering that over $100 million was reportedly spent to train OpenAI’s GPT-4, it’s clear why DeepSeek has become such a hot topic. In fact, one week after its release, DeepSeek’s AI chatbot app overtook ChatGPT to become the most downloaded free app on the U.S. iOS App Store.

DeepSeek’s rise to prominence has even raised questions about the competitiveness of established AI companies. On January 27, the U.S. stock market saw NVIDIA shares plunge 16.84%, wiping out about $600 billion in market value. Other big tech firms investing heavily in AI, such as Broadcom (down 18.15%), Oracle (13.39%), Arm (10.19%), and TSMC (13.33%) — the world’s largest semiconductor company — also saw sharp declines.

While the performance benchmarks and training cost data released by DeepSeek require further verification, it seems undeniable that DeepSeek has delivered a major shock to today’s AI market.

Core Technology: What makes DeepSeek so special?

DeepSeek first introduced DeepSeek Coder in November 2023, then successively launched DeepSeek LLM, V2, and V3 before releasing the R1 series on January 20 of this year. The R1 series consists of:

DeepSeek-R1-Zero: Enhanced inference performance from DeepSeek-V3-Base through reinforcement learning
DeepSeek-R1: Further fine-tuned from R1-Zero for additional performance gains
DeepSeek-R1-Distill: Implemented by distilling open-source LLMs such as Llama/Qwen

Ultimately, the much-discussed R1 model shares the same architecture as the previous V3 model. DeepSeek-R1 is said to contain 671 billion parameters, making it the largest open-source AI model disclosed to date.

According to DeepSeek, its language models (V3 or R1) were trained at significantly lower costs than other existing language models like Llama 3.1 or GPT-4o, yet they deliver equal or even superior performance.

Indeed, they have demonstrated excellent performance across most benchmarks, particularly in mathematical reasoning and coding tasks.

So, how did DeepSeek drastically reduce the massive training costs associated with LLMs? Let’s dive into the training process of DeepSeek’s language model (R1) and take a closer look at the technologies applied along the way.

How DeepSeek-R1 is Trained

Generally, an LLM is trained in the following three steps:

Language Modeling: Using large volumes of text data (books, documents, web pages, etc.) to help the base model learn language comprehension skills such as sentence structure, word usage, and context. This is typically done by predicting missing words in a sentence (auto-encoding) or predicting the next word from given words (auto-regressive).
Supervised Fine-Tuning (SFT): Further training the base model so that it can accurately interpret user instructions and produce appropriate answers.
Preference Tuning: Adjusting the model so that its outputs align with human preferences, typically involving a reward model in reinforcement learning.

With this in mind, let’s examine how DeepSeek-R1 is trained.

(Source: https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1)

DeepSeek-R1 is a fine-tuned version of the DeepSeek-V3 Base model, sharing the same architecture. The R1 model was trained by adding about 600,000 reasoning data points (e.g., math, coding) and about 200,000 non-reasoning data points (e.g., writing, translation) on top of the V3 Base model. The training process can be divided largely into 2 stages: SFT and reinforcement learning.

Supervised Fine-Tuning (SFT)

A noteworthy point during the SFT stage is that the roughly 800,000 data points used were not labeled by humans but rather generated by other DeepSeek AI models. Non-reasoning data came from DeepSeek-V3, while a separate “reasoning model” produced the reasoning data (referred to as the “Interim reasoning model” in the diagram).

DeepSeek chose this approach because it was possible to create a model that could generate around 600,000 reasoning data points (each consisting of an instruction and the corresponding output) from just a few thousand labeled reasoning data points. This is closely related to the training of the R1 Zero model, which we’ll revisit in the reinforcement learning section below.

Reinforcement Learning

Reinforcement Learning The key reinforcement learning technique applied to R1 is GRPO (Group Relative Policy Optimization). This method enabled the R1 Zero model to achieve strong performance on reasoning tasks without additional SFT.

Unlike the conventional reinforcement learning framework (Proximal Policy Optimization, PPO), where a value model evaluates the generated answers and feeds back through a reward model, GRPO does not require a value model. Instead, GRPO groups multiple outputs generated from the same input and uses the average score (as measured by a reward model) within that group as a baseline for training. The absence of a value model significantly reduces memory usage. Starting from R1, even the reward step has been replaced by a rule-based system, effectively removing the need for a reward model as well. As a result, there is minimal human intervention in the training process.

Let’s briefly see how GRPO works through a simple example:

In this example, a prompt asks for Python code that meets certain conditions, and the training applies the following rules:

Is the code written with correct Python syntax?
Does the code run successfully?
Does the code pass unit tests? (Generated using another LLM)

Among four different outputs, the fourth one received the highest score. GRPO then updates the model so that it can generate high-scoring outputs more consistently.

The “Interim reasoning model” mentioned in the SFT section was trained using a similar approach.

Concretely, a few thousand cold-start reasoning data points generated by R1 Zero were used to fine-tune the V3 Base model, followed by reinforcement learning. This process included post-processing cold-start data to address R1 Zero’s known issues with low readability and language mixing. DeepSeek then leveraged this trained model to easily acquire the 600,000 reasoning data points needed for training the R1 model.

In summary, the reinforcement learning process applied to R1 can be depicted as follows:

Reasoning: SFT using data generated by the Interim reasoning model and rule-based reinforcement learning afterward
Non-reasoning: SFT using data generated by V3 and additional reinforcement learning in a more conventional manner (using safety reward and helpfulness reward models like Llama)

Architecture

MoE(Mixture-of-Experts)

MoE is a structure in which different experts handle different tasks instead of a single model handling all tasks. In LLMs, experts are typically divided based on tokens; when an input comes in, a router assigns it to relevant experts to produce the output.

Although R1 has around 671 billion parameters, it’s said that only about 34 billion are activated when a question is posed. DeepSeek’s MoE architecture is designed so that each expert concentrates on a more fine-grained domain of knowledge (fine-grained expert segmentation), with some experts always active to handle fundamental knowledge (shared experts). This approach allows DeepSeek’s LLM to use less memory and to process requests quickly and efficiently compared to other models.

MLA(Multi-Head Latent Attention)

MLA is a technique that compresses the information in the KV cache (Key-Value cache), referenced during attention operations, into a latent space. This drastically reduces memory usage and accelerates response times.

When generating an answer (decoding), the attention function updates the Query value at each token step, but the Key and Value are reused from a cache referred to as the KV cache. While this caching mechanism greatly speeds up inference, it can also cause memory bottlenecks in LLMs, as the cache size grows with the context window size (the maximum number of tokens the model can reference).

MLA was first introduced in DeepSeek-V2, released in May 2024. According to DeepSeek, MLA reduces KV cache memory usage by 93.3% compared to standard attention.

Implications

In a July 2024 interview, Anthropic CEO Dario Amodei estimated that roughly $100 million was spent on training the currently available GPT-4o. He noted that the LLM his company is currently developing will cost ten times more ($1 billion) to train, and he anticipates LLM training expenses could reach $100 billion within the next three years.

As AI continues to be rapidly adopted across a growing range of applications, and expectations for AI performance continue to rise, the costs associated with training and inference are skyrocketing. It’s well-known that the massive costs of developing and maintaining AI have been a significant barrier to AI research and widespread deployment. So far, these costs have been seen as an inevitable trade-off for achieving high performance.

Against this backdrop, the market’s response to DeepSeek suggests that the AI industry — once centered on performance alone — may be shifting toward a focus on achieving solid performance at a more reasonable cost. Furthermore, since DeepSeek has open-sourced the R1 model, research into developing AI models that are both high-performing and cost-effective will likely accelerate.

Optimization for AI: From Training to Inference

While DeepSeek’s language models also incorporate techniques that enhance inference performance (e.g., MLA), the core of DeepSeek’s competitive edge lies in optimizing the training stage to reduce costs. In contrast, ENERZAi’s Optimium aims to optimize the inference stage by accelerating performance and reducing inference costs.

Although the astronomical expenses of training LLMs currently draw the most attention, for companies offering AI-based products or services, success ultimately depends on inference performance within the target hardware environment and the recurring costs of inference.

The ENERZAi team’s Optimium offers a clear competitive advantage in inference performance and flexibility compared to other existing inference engines. It maximizes inference speed while preserving the accuracy of trained AI models, and it enables a single product to deploy models optimized for various hardware environments.

Optimium is currently in beta testing. The beta version supports inference optimization for CPU-based CNN models. Over the course of this year, we plan to expand hardware support to include GPUs and extend model support to Transformers. If you have any needs related to accelerating inference speed or cutting inference costs, we encourage you to apply for our beta.

https://wft8y29gq1z.typeform.com/to/fp059MY5

Optimium

Solutions

Company

Resources

ENERZAi