3. Februar 2025 – In der zurückliegenden Woche hat ein neues Large Language Modell von der chinesischen Firma DeepSeek nicht nur die Tech-Welt in Aufruhr versetzt, sondern beherrschte auch weltweit die Wirtschaftsnachrichten und den allgemeinen Newsstream. Wir wollen verstehen, was genau passiert ist und sprechen heute mit Jannik Malte Meissner, einem AI-Unternehmer und Technologieexperten, der selber LLMs trainiert und finetuned.
ARIC: Jannik, you’ve experienced the excitement surrounding DeepSeek R1 like the rest of us, but as an expert you certainly have a deeper perspective that interests us. We want to break down together what the technology behind DeepSeek’s new models is and what consequences this has for the AI scene, but also for all companies that want to use AI.
So, what exactly has happened at DeepSeek to make everyone so excited and nervous? Is there a serious background to this or is it just short-term hype?
Jannik Meissner: A quick note, as there is a lot of confusion in the reporting: DeepSeek V3, the basic model, has been around since December last year and stands out because it is based on a very efficient training process. Now comes DeepSeek R1, which is based on reinforcement learning (hence the R1) and follows a new post-training approach that can even compete with the best proprietary models from OpenAI and Anthropic. It is the first open model to top many benchmarks.
It has been claimed that the DeekSeek model was trained for just six million dollars. Phil Schmid from Hugging Face said that for the basic costs, i.e. for training the base model alone, that might be enough. But there’s a lot more to the R1 model. Can you break down how such a model is trained in the first place and what other stages are involved and estimate the additional costs?
The frequently cited costs refer primarily to a statement from the paper on DeepSeek V3, i.e. the basic model on which R1 is based. However, this is only a hypothetical sum based on an assumed cost of two dollars per hour of use of the H800 GPUs used here. In principle, I agree that the number of hours and the assumed price are realistic, but there are a few caveats. Firstly, it must be said up front that HighFlyer, the company behind DeepSeek, does not rent the GPUs. My understanding is that they primarily use GPUs that are not needed for their core business (HighFlyer is a financial company that specializes in quantitative trading). Accordingly, the two dollars per hour per GPU is not the actual cost.
In addition, many other costs are also incurred during the development period, for example in the hyperparameter search or the comparison of different initializations, the comparison of different methods in training and more, which have not been included here either. For these processes, many smaller models are often trained on a test basis in order to compare different aspects.
Large amounts of text data were also required for the first training phase, the so-called pre-training. The amount is roughly the same as that used by Meta for the Llama 3 pre-training: However, it can also be assumed here that at least some of it was also generated by other models, such as OpenAI’s GPT models, Meta’s Llama 3 or Anthropic’s Claude. This also incurs costs that were not included in the mapping.
Ok, then why don’t you describe how such a model is developed? You’ve done this yourself many times, why don’t you take us through the process?
So if we proceed step by step:
- The first step is to prepare the data. This involves automatically filtering and sorting very large volumes of data. The paper does not go into this.
- The next step is to experiment to determine the optimal configuration of the model, i.e. the hyperparameters, the total amount of training data and optimizations for the training process. This is also not mentioned here.
- This is followed by the pre-training phase, which is specified as approximately 2.7 million GPU hours. This results in the DeepSeek V3 base model.
- This is followed by several post-training phases. In the DeepSeek R1 paper, however, there is no information on how much computing power was used here. Before these were introduced, the DeepSeek R1-Zero experiment was carried out, in which only reinforcement learning was used. However, as this did not produce satisfactory results, this model was not published. Without this model, however, the development of the final model would not have been possible.
This final model was first trained using data from the R1-Zero model and other unnamed sources, then the same method as R1-Zero was used, followed by supervised fine-tuning for text-heavy work and then a reinforcement learning phase. However, there is no information on the costs involved.
“there are some indications that models such as Llama, GPT-4o, O1 or Claude were used as so-called ‘teacher models’“
You just mentioned “other unknown sources”. One accusation made against DeepSeek is that Large Language Models (LLMs) such as GPT-4 were used to train the model (which is forbidden according to the ToS of the provider OpenAI) or that content was even copied illegally. Could there be something to it?
I think that’s very realistic. The models often refer to themselves as “ChatGPT from OpenAI”. Of course, these text modules can often be found like this on the Internet today and they could possibly also have been learned from simple web scraping, but there are some indications that models such as Llama, GPT-4o, O1 or Claude were used as so-called “teacher models”. However, this is something that does not only apply to DeepSeek. However, this alone is not the reason for the good performance. In my opinion, it is even more likely that this data was used for training text skills such as poetry and translations and not for the particularly prominent logic skills for programming tasks and mathematics.

Jannik Malte Meissner is a technology entrepreneur and software developer. He has specialized in deep learning since 2014. In the past, he founded companies in the areas of clean tech, IT infrastructure and retail analytics.
Jannik is co-founder of the startup Neuralfinity in Hamburg & London, which develops a training platform for customized, task-specific large language models and vision language models. His focus is particularly on the further development and scaling of transformer models.
Okay, let’s get back to the technology. What exactly is innovative about the DeepSeek model? Until now, the basic assumption of the so-called Scaling hypothesisthat we simply need to make the existing models (especially those with the Transformer architecture) larger and train them with more data. However, there are already deviations from the “naive” scaling assumption, e.g. through differentiated architectures such as Mixture of Experts and variations in the attention methods, as well as the so-called reasoning methods such as “Chain of Thought”. What is new about DeepSeek?
Post-training in particular is new here: although many efficiency gains have been achieved in pre-training, these tend to underline the previous scaling hypothesis. The current hot topic here is test-time compute. This describes an approach for the dynamic adaptation of computing resources during the inference phase of an LLM. The amount of computing power used is optimized based on the complexity of the respective task – simple tasks receive fewer resources, while more complex tasks are allocated more computing power. This is done through mechanisms such as iterative revisions or parallel sampling to maximize efficiency and avoid unnecessary computational overhead. This approach is not new, but DeepSeek R1 is the first open-weights model to successfully implement it.
To achieve test-time compute scaling, the team behind R1 relies on a method they call “Group Relative Policy Optimization” (GRPO) and presented for the first time in a paper in April last year. This is a form of reinforcement learning that does not require an additional external model. This method is relatively efficient and means that the model itself learns how to use Test-Time Compute efficiently and can develop Chain of Thought and other methods itself without prior input. Ultimately, however, according to the paper, it was more expedient to apply this method only after the model had already been trained somewhat with manually selected examples.
Reinforcement learning (RL) was also the breakthrough for AI models such as AlphaGo and AlphaFold. In both cases, RL took the existing models to a whole new level. Do you expect the same for LLMs and are there further leaps in quality to be expected from intelligent processes beyond more-data-more-compute?
In any case, it cannot be ruled out. The paper currently leaves open how the smaller models that were distilled from the R1 model will behave if another phase of reinforcement learning is added. I will be conducting a few experiments on this myself in the coming weeks.
Now let’s look at the economic impact. Nvidia’s share price has fallen by 17 percent within one day. Does the new efficiency mean that less hardware is needed, as LLMs can now be trained and operated with far fewer resources?
I think that’s a fallacy: I refer here to the Jevon paradox, which states that the more efficient training becomes, the more people will try and use it, which means that more compute is ultimately required.
In addition, many of the efficiency gains in pre-training were achieved by using features that were only available with Nvidia’s Hopper hardware. In order to make use of this, many people who still use the Ampere architecture or older GPUs will have to upgrade their hardware.
“The American bias is culturally closer to ours, which is why it is often less noticeable.”
With other Chinese products such as TikTok, the issue of data protection and political bias is a hot topic. DeepSeek also states in its terms of service that the data is used for training and is passed on to the Chinese authorities in accordance with Chinese law. That’s a real no-go for consumers and developers, isn’t it?
From a European perspective, it is therefore similar to the US models. However, the data protection problems can at least be circumvented by the fact that it can also be run on your own servers as an open-weights model. With the models behind ChatGPT, for example, this is only possible in the Microsoft Cloud, which, according to many experts, is ultimately subject to US law due to the Patriot Act and Cloud Act, even with European server locations.
As far as bias is concerned, this also applies to every model: they always reflect the bias and social context of those who train them. The American bias is culturally closer to ours, which is why it is often less noticeable. For this very reason, it would be nice if we in Europe were a little more ambitious in our approach to AI and also set out to develop our own “frontier models”. For many use cases, however, much smaller models are sufficient, of which we have more and more here in Europe. Mistral, for example, has just presented another new one today.
“the model (…) cannot be reproduced independently and, in my opinion, does not comply with the principle of open source”
Finally, a question about open source. The DeepSeek models are open source. The risk of an uncontrolled outflow of data would be averted through in-house operation. The “safeguards” observed in the DeepSeek app, which censor some of the content, are not included in the open source version either. Do we still have to worry that the model contains distortions and censorship that we can’t get out and that call into question the professional use of such a model? How can this be determined and possibly “tuned away”?
Unfortunately, open source is also a question of definition here. Unfortunately, neither the training code nor the data has been disclosed. So we only have a paper, open weights and inference and fine-tuning code. This means that the model cannot be reproduced independently and, in my opinion, does not comply with the principle of open source, only open weights.
Some biases are definitely included, as in all other models. Recognizing and removing these biases would certainly be an exciting open source project. I am not aware of anything publicly available, even if there is already research on this. It is well known that Google’s attempt to remove biases has gone wrong before.
Finally, a special question that interests all of us who are currently experimenting with RAG (Retrieval Augmented Generation). Isn’t the very lengthy “thought” process in the DeepSeek models excessive and inefficient for simple questions or generating answers based on RAG contexts? Can this be switched off? Or is it better to use other models?
It depends on the application. When it comes to finding solutions, e.g. for automation in software development processes, biotechnology or theoretical mathematics, a model like DeepSeek’s can bring a lot of added value if, for example, some code examples, internal APIs or other internal knowledge is linked to the chain-of-thought process.
For simple text and formulation work or question-and-answer applications that do not require complex problem-solving approaches, I would definitely recommend much smaller models, which are also much cheaper to run.
We thank you for these interesting insights and wish you continued success with your company Neuralfinity!
Interview: Werner Bogula
Notes:
- As ARIC, we recommend that you do not use DeepSeek – neither as an end user in the app or the web version, nor as a developer using the API.
- With our interviews, we want to introduce you to different perspectives and players in the field of AI. The positions of our interview partners do not necessarily reflect the positions of the ARIC.
Further interviews:

