China's Large AI Models Likely to Lead the World, but Most 10,000-GPU Clusters Are Inefficient, Says Academician

Zheng cited GPT-4 as an example, explaining that the model required 10,000 GPUs over the course of 11 months, with nearly half of that time spent on data preprocessing. This phase remains highly inefficient by current standards.

TMTPOST -- China has significant advantages in developing large model solutions tailored to different industries, and could potentially lead the world, said Zheng Weimin, a member of the Chinese Academy of Engineering and a professor at Tsinghua University's Department of Computer Science and Technology.

Zheng made the remarks on Wednesday at a conference co-organized by Global Times, the Center for New Technology Development of the China Association for Science and Technology (CAST), and the Technology Innovation Research Center of Tsinghua University.

In 2024, China's large AI model industry was characterized by two main trends: the transition from foundational large models to multimodal models, and the integration of large models with industry applications, he noted.

Zheng explained the five key stages in the lifecycle of large models and identified the challenges at each step. The first stage is data acquisition. Large model training requires massive amounts of data, often in the billions of files. The difficulty lies in the frequent reading and processing of these files, which can be time-consuming and resource-intensive. 

The second stage is data preprocessing. Data often requires cleaning and transformation before it can be used for training. Zheng cited GPT-4 as an example, explaining that the model required 10,000 GPUs over the course of 11 months, with nearly half of that time spent on data preprocessing. This phase remains highly inefficient by current standards.

The most widely used software in the industry for this process is the open-source Spark platform. While Spark boasts an excellent ecosystem and strong scalability, its drawbacks include slower processing speeds and high memory demands. For instance, processing one terabyte of data could require as much as 20 terabytes of memory. Tsinghua University researchers are working on improvements by writing modules in C++ and employing various methods to reduce memory usage, potentially cutting preprocessing time by half.

The third stage in the lifecycle is model training. This step demands substantial computational power and storage. Zheng emphasized the importance of system reliability during training. For example, in a system with 100,000 GPUs, if errors occur every hour, it can drastically reduce training efficiency. Although the industry has adopted a "pause and resume" method, where the system is paused every 40 minutes to record its state before continuing, this approach is still limited in its effectiveness.

The fourth stage is model fine-tuning, where a base large model is trained further for specific industries or applications. For example, a healthcare large model may be trained on hospital data to produce a specialized version for the medical field. Further fine-tuning can create models for more specific tasks, such as ultrasound analysis.

AI chips play a critical role in the large model industry, and Zheng highlighted the need for greater domestic chip development. While China has made substantial progress in AI chips over the past years, there are still challenges in terms of ecosystem compatibility. For example, it may take years to transfer software designed for Nvidia to systems developed by Chinese companies. The industry’s current strategy is to focus on improving software ecosystems to enable better linear scaling and support for multi-chip training.

Zheng further pointed out that building a domestic "10,000 GPU" system, although challenging, is essential. Such a system would need to be both functionally viable and supported by a strong software ecosystem. Additionally, heterogeneous chip-based training systems should be prioritized for their potential to accelerate AI development.

China's computing power has entered a new phase of rapid growth, largely driven by projects such as the initiative to build national computing network to synergize China’s East and West, and large model training. High-end AI chips are in heavy demand for large model training, while mid- to low-end chips remain underutilized, with current utilization rates hovering around 30%. With proper development of China’s software ecosystem, this rate could potentially increase to 60%.

At the event, Jiang Tao, the co-under and senior vice president of iFLYTEK, introduced "Feixing-1", China's first large-scale AI model computing platform. iFLYTEK’s large models have already reached performance levels comparable to GPT-4 Turbo, surpassing GPT-4 in areas like mathematical reasoning and code generation, according to Jiang.

You Peng, the president of Huawei Cloud AI and Big Data, shared his views on the future of the AI industry. He predicted that the number of foundational models would likely be concentrated in the hands of three or five key players. However, the need for industry-specific models would continue to grow, creating opportunities for other companies to build specialized applications based on these foundational models. 

You summarized three key points from Huawei’s AI-to-Business (AI To B) practices. First, not all companies need to build massive AI computing infrastructures, especially since many can leverage cloud-based solutions for efficient training, reinforcement learning and reasoning.

Second, companies may find it more cost-effective to apply mainstream foundational models to their specific use cases rather than training their own models.

Lastly, not all applications companies need to pursue large models, as smaller, specialized models can continue to be valuable tools in specific domains, with large models serving as coordination systems.

转载请注明出处、作者和本文链接
声明:文章内容仅供参考、交流、学习、不构成投资建议。
想和千万钛媒体用户分享你的新奇观点和发现,点击这里投稿 。创业或融资寻求报道,点击这里

敬原创,有钛度,得赞赏

赞赏支持
发表评论
  • 给小编加鸡腿🍗
  • 爱了爱了😁
  • 挺有深度的,不错
  • 紧跟时事,赞一个👍🏻👍🏻
  • 真不错,收藏了
  • 写的很不错,关注了
  • 都没有那么简单
  • 这么厉害的吗
  • 学到了学到了
  • 商场如战场,竞争激烈啊
  • 行业发展都是有周期的
  • 企业的发展都不是一番风顺的
  • 说的好有道理😄
  • 内容值得人们反思
  • 数据还是很详细的
  • 内容很精彩,夸一夸
  • 又学到了很多知识
  • 内容很详细👍🏻
  • 小编辛苦了
0 / 300

根据《网络安全法》实名制要求,请绑定手机号后发表评论

登录后输入评论内容
7
5

扫描下载App