Nanhu Computing Framework is a large-scale intelligent computing framework that enables efficient collaboration among computing, storage, and networking resources. It advances the decoupling of AI foundational model training from specific GPU types.

Supporting efficient collaboration across computing, networking, and storage resources, supporting multi-brand heterogeneous accelerators and providing trillion-parameter large language model training capacity

Nanhu Computing Framework supports large language model training across various types of accelerators. It introduces the first automatic tuning framework for heterogeneous cluster model training, addressing the issue of extensive accelerator resource consumption during strategy tuning for large language models. The search for optimal training strategies consumes only smaller-scale CPU resources instead. It supports adaptive uneven pipeline partitioning and automatic hybrid training strategy search, reducing tuning costs by over 90%.

For fine-tuning, the framework proposes a hierarchical parameter-sharing method that reduces fine-tuned parameter counts by 44.59% while maintaining model performance, significantly lowering computational resource demands during fine-tuning and better capturing both local and global information.

It pioneers a hierarchical cache management strategy and batch writing techniques, achieving 3.07× to 4.99× improvement in update performance compared to the most advanced persistent storage memory systems, significantly reducing read-write overhead in data processing and accelerating parameter updates during large language model training.

Facilitating cross-brand heterogeneous collaboration, providing high-speed communication across heterogeneous accelerators and achieving significant improvement in model training efficiency without model quality compromise

Nanhu Computing Framework achieves full compatibility with multiple mainstream accelerator brands, facilitating cross-brand heterogeneous collaboration. It is the first to achieve GPU Direct RDMA high-speed interconnect collective communication among multiple heterogeneous accelerators, which enables the construction of Nanhu Collective Communication Library providing high-speed communication across heterogeneous accelerators. The framework provides the capacity for large-scale heterogeneous compute scheduling with high bandwidth and low latency. We initiated the proposal of the national standard "Intelligent computing cluster-Test method of computing node interconnection."

By reconstructing the collective communication architecture, the framework consumes zero computing resource on accelerator during communication, boosting All-to-All collective communication bandwidth by 1.85× compared with traditional communication libraries. It supports FP8 mixed-precision training across multiple accelerator types. Through communication and memory optimizations, it achieves over 30% improvement in model training efficiency without model quality compromise.

Achieving high-availability intelligent operations and maintenance at 10,000-GPU-scale cluster, supporting efficient training of trillion-parameter large models on heterogeneous clusters

Nanhu Computing Framework enables fault detection within seconds which raises 10,000-GPU-scale cluster availability to 97%. Through intelligent fault detection and automated troubleshooting, the effective training time ratio reaches 98.1% for large language models training tasks.

The framework has been successfully applied to the training of trillion-parameter large language models, supporting heterogeneous accelerator hybrid training with advantages in high compatibility, stability, and cost efficiency. It will promote efficient collaboration and industrial applications within heterogeneous clusters.

The World Internet Conference (WIC) was established as an international organization on July 12, 2022, headquartered in Beijing, China. It was jointly initiated by Global System for Mobile Communication Association (GSMA), National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT), China Internet Network Information Center (CNNIC), Alibaba Group, Tencent, and Zhijiang Lab.