AI's Data-Driven Growth Bubble Set to Bust, Says Professor of Fudan University

"We need to reassess the Scaling Law. Moreover, we must address these issues at their root, to stimulate the core cognitive abilities of large models and enhance their level of rationality," Xiao noted.

TMTPOST--The relentless expansion of AI large language models (LLMs) in line with the Scaling Law—driven by increasing training data, computational power and parameters—may soon hit its ceiling, warns Xiao Yanghua, a professor at Fudan University's School of Computer Science and Technology, and Director of the Shanghai Key Laboratory of Data Science.

"We need to reassess the Scaling Law. Moreover, we must address these issues at their root, to stimulate the core cognitive abilities of large models and enhance their level of rationality," Xiao noted.

At the 2024 Inclusion Conference on the Bund held from September 5 to 7, the "From Data for AI to AI for Data" forum projected that by 2026, the amount of new data generated by humans will fall behind the data needed for training models. By 2028, it is estimated that AI LLMs will deplete human-generated data.

This raises concerns that future models, whether using high-quality open data or information from the internet, will eventually reach a "bottleneck," making it difficult to achieve superhuman general artificial intelligence (AGI).

Xiao emphasized that the crux of AI model implementation lies in data engineering. However, the current development of large models is characterized by "crude" data consumption and inefficient usage, far inferior to human data processing. Furthermore, he pointed out that much of the data used in these massive models could be seen as "fluff," suggesting that we are already approaching the point where AI LLMs have exhausted useful data.

The rapid expansion of LLMs has led to an increase in the scale of data consumption. For instance, Meta's open-source model Llama 3 reportedly uses 15 trillion tokens, more than 200 times the size of the Library of Alexandria, which held approximately 70 GB of data. OpenAI's GPT-3.5 utilized 45 terabytes of text data, equivalent to 4.72 million copies of China's Four Great Classical Novels. GPT-4 went further, incorporating multi-modal data with a scale of hundreds of trillions of tokens.

Despite the impressive capabilities these models demonstrate, they still face significant challenges, including the infamous "hallucinations" and lack of domain-specific knowledge. OpenAI's GPT-4, for example, has an error rate of over 20%, largely due to the absence of high-quality data.

Xiao highlighted that data quality determines the "intelligence ceiling" of AI LLMs. Yet, around 80% of the data used in large-scale models may be redundant or erroneous, making the refinement of data quality and diversity critical for the future development and application of AI technology.

Xiao outlined three potential pathways to improve AI LLMs through high-quality data: synthetic data, private data, and personal data.

Xiao also believes that the current reliance on expanding model parameters—often with redundant information—may soon reach its limits. He advocates for a shift towards smaller, more refined models that retain only the most critical data, allowing AI to achieve higher levels of rationality and intelligence.

He argues that the current surge in generative AI models is a bubble that will inevitably burst, as the growth in high-quality data production is relatively slow. The challenges of controlling synthetic data quality and the limits of deductive reasoning will also cap AI's potential. Even if models are trained with parameters ten or a hundred times the size of the human brain, the limits of human cognition may prevent us from fully understanding or utilizing such superintelligent systems.

Ultimately, Xiao sees AI as a "mirror" that forces humanity to confront the things that lack value in society, pushing human beings to focus on what truly matters. He concludes that AI's future will compel industries to return to their core values and drive humans to pursue more meaningful and valuable endeavors.

As the AI field continues to evolve, the debate over data quality, scaling limits, and the role of synthetic data will shape the next phase of development. But one thing is certain: the road to AGI will be paved with challenges that extend beyond mere data accumulation.

转载请注明出处、作者和本文链接
声明:文章内容仅供参考、交流、学习、不构成投资建议。
想和千万钛媒体用户分享你的新奇观点和发现,点击这里投稿 。创业或融资寻求报道,点击这里

敬原创,有钛度,得赞赏

赞赏支持
发表评论
0 / 300

根据《网络安全法》实名制要求,请绑定手机号后发表评论

登录后输入评论内容

快报

更多

19:13

第三届数贸会总签约额达1650.8亿元

19:12

宁德时代总部生产基地发生火灾,曾毓群回应:问题不大

19:07

明日1只新股上市:创业板长联科技

19:03

敬业集团与日钢营口中板合并,将成全球最大中厚板制造商

19:01

永辉超市:网上流传有关董事长和胖东来调改事项不存在

19:00

河南出台16项措施,支持个体工商户和小微企业发展

18:40

外资大举买入中概股ETF

18:38

大连万达商管:终止大公国际对公司及相关债券债项评级

18:37

上海、江苏、浙江、安徽开展2024年度长三角科技创新共同体联合攻关重点揭榜任务工作

18:37

多家房企宣布涨价

18:30

国家西北区域应急救援中心建成即将投入使用

18:29

历史峰值,国泰君安一分公司日开户量涨10倍

18:22

济南调整住房公积金贷款政策,二孩及以上家庭最高能贷130万

18:21

宁德时代Z基地电池厂火情已初步得到控制

18:19

国资委:支持国有企业大胆试错,着力当好发展实体经济的长期资本、耐心资本、战略资本

18:15

北京住建委、北京网信办联合约谈个别违规自媒体账号负责人

18:14

南京:整合20亿元左右财政资金支持汽车、家电、家装家居、电动自行车等9个方面消费品以旧换新

18:13

南京:探索为“以旧换新”购房者提供存量房贷款转至新房等金融支持

18:11

汇丰晋信完成工商变更,刘鹏飞接任董事长

18:09

四大行及银联响应两行业协会倡议,将支付降费政策优惠延长三年

扫描下载App