Chinese AI lab DeepSeek recently launched an updated version of its R1 reasoning model. This new DeepSeek model performs strongly on several math and coding benchmarks. However, DeepSeek has not revealed where it sourced the training data. Some AI experts speculate that DeepSeek may have used data from Google’s Gemini family of AI models.
Sam Paech, a developer based in Melbourne who studies AI emotional intelligence, shared evidence suggesting DeepSeek’s latest model, R1-0528, shows a strong preference for words and expressions similar to those favored by Google’s Gemini 2.5 Pro. Paech posted his findings on X, highlighting these striking similarities between DeepSeek’s output and Gemini’s.
Earlier this year, OpenAI told the Financial Times it detected signs linking DeepSeek to the practice of distillation. Distillation involves training smaller AI models using outputs from larger, more capable ones. According to Bloomberg, Microsoft noticed large amounts of data moving through OpenAI developer accounts late last year. OpenAI suspects these accounts are linked to DeepSeek.
Although distillation is a common AI training method, OpenAI’s terms prohibit using its model outputs to build competing systems. Yet, distinguishing AI-generated data from human-written content remains challenging. The web now contains many AI-generated articles and posts, often called “AI slop.” Content farms and bots flood platforms like Reddit and X with synthetic text. This contamination makes it difficult to filter out AI-generated data during model training.
Despite this, some experts believe DeepSeek might have relied heavily on Gemini-generated content. Nathan Lambert, a researcher at nonprofit AI2, commented on X. He said, “If I were DeepSeek, I would generate massive synthetic data from the best API model available.” Lambert noted that DeepSeek has money but fewer GPUs, making this strategy smart.
To counter distillation risks, AI companies have increased security measures. In April, OpenAI began requiring ID verification for accessing some advanced models. The process demands a government-issued ID from certain supported countries. Notably, China, where DeepSeek is based, is excluded from this list.
Google also recently started “summarizing” data traces from models on its AI Studio platform. This step complicates efforts to train rival models on Gemini outputs. Likewise, in May, Anthropic began summarizing its own model traces to protect its competitive edge.
We have reached out to Google for a comment and will update this story if we receive a response.
For more tech updates, Visit DC Brief.