DeepSeek Thoughts

January 28, 2025by SineWave Ventures

The recent release of DeepSeek, an open-source reasoning model, has set off another artificial intelligence-induced shockwave.

DeepSeek is reported to be up to thirty times cheaper to train than comparable large language models. This has resulted in people questioning prior assumptions about how to improve upon prior performance, and asking how such a large reduction in cost is possible? There are a number of key architectural changes outlined in the technical report. Below, we’ll touch on a few of those advances – multi-token prediction; improved “mixture of experts” framework; efficient sharing of computation and communication on hardware; and training using mixed-precision arithmetic.

DeepSeek implements “Multi-Token Prediction,” allowing the model to simultaneously predict multiple tokens in a sequence during training rather than the industry standard of one. Essentially, by looking ahead at the next few tokens, one can generate a more contextually-aware output.

DeepSeek is a “Mixture-of-Experts” (MoE) language model, where expert models are trained are on a specific domains like math, coding, or general text. Others have implemented MoE language models beofre, but DeepSeek takes the additional step of learning weights that help decide which experts to use. When processing an input, DeepSeek first analyzes the context and then activates a small number of the most relevant expert models.

DeepSeek overlaps computation and communication phases in a parallel pipeline to optimize the use of available hardware, designing the system for fast cross-node data transfer to eliminate bottlenecks.

DeepSeek uses mixed-precision arithmetic to reduce GPU memory usage and computational cost. By adopting 8-bit floating-point (FP8) arithmetic for the massive number of matrix multiplications that are required to train a large language model, and limiting the use of FP32 computations to only those portions of training that require higher precision, DeepSeek has demonstrated that it’s possible to use fine-grained quantization while simultaneously achieving numerical stabilization. The result is a significant reduction in training cost.

The accomplishments of the DeepSeek system are worthy! Given that their work is open-source, others will undoubtedly quickly follow suit and modify architectures where appropriate.

In addition to shaking up ideas on how to do GenAI a bit, DeepSeek’s arrival on the scene should lead technologists to rethink some unchallenged strategies with respect to AI and to the Chinese development of technology.

The early excitement about generative AI was a reaction to the fact that it could be done.  It unlocked significant advances in information processing and enabled new practical use cases for AI in the enterprise.  Companies competed to improve and expand on the effectiveness of this new capability, but did not stop to look back at whether their methods were the most efficient.  Over time, its inefficiency, in terms of power, data, and dollars started being a roadblock to broad adoption.  The emergence of DeepSeek demonstrates that there are still many improvements that can be made to GenAI processing that will further propel it to success.

And what about the tech “war” with China.  The United States very cleverly denied the Chinese access to our most capable chips, unintentionally encouraging them to be more creative and to find a solution that was not as resource intensive as the accepted methods.  Classic necessity is the mother of invention. America’s ham-fisted attempts to hold them back backfired.  We need to re-think the nation’s relationship to the Chinese tech base.  The United States won’t “win the war” by trying to block the Chinese out.  Instead, the United States will win the war by being more innovative and building better tech.