
Introduction
Focusing on the two core bottlenecks in the field of embodied intelligence-"the scarcity of training data"and "insufficient model generalization capability"-this project innovatively initiates a new paradigm of "large-scale synthetic simulation data for pre-training, complemented by real-world data for post-training."Starting from fundamental research and extending to cross-industry practical applications, the project independently develops key technologies such as physics-realistic simulation, Real2Sim2Real, and end-to-end Vision-Language-Action (VLA) models. This effort establishes a comprehensive, integrated technical system encompassing data generation, model training, and application deployment.
Embodied AI Foundation Model Dual-Driven by Synthetic and Real Data
Addressing the two major bottlenecks of scarce data and poor generalization in embodied intelligence, Galbot has innovatively proposed a novel virtual-real data fusion training paradigm: "large-scale simulated synthetic data for pre-training + a small amount of real-world data for post-training. "Starting from the underlying physics solver, the project constructs a physics-realistic simulation data synthesis pipeline, integrates a real-world data acquisition system, and has built the industry's advanced and comprehensive automated toolchain for "data production-labeling-review-training-testing."This system closed-loop iterates data production and model training with low cost and high efficiency, optimized for application deployment goals. The project has yielded multiple high-quality embodied intelligence datasets crucial for global strategic needs, including the billion-scale dexterous hand grasping dataset DexGraspNet 2.0, the spatial intelligence big data SpatialNet covering 770,000 3D object assets, and the ultra-large-scale components dataset GAPartNet.
Building upon this foundation, the project developed the world's first end-to-end embodied grasping foundation model, GraspVLA. Its pre-training was entirely based on billions of "Vision-Language-Action" synthetic data points, endowing it with powerful Sim2Real capability. It achieves zero-shot generalized grasping in ever-changing real-world scenarios and supports rapid adaptation from a general model to an expert model through post-training with minimal real samples. Furthermore, the project trained the world's first pure vision "FSD" navigation foundation model for embodied intelligence, NaVid, and based on this foundation model, released the first product-level navigation foundation model, TrackVLA.
Cross-Domain Applications of Embodied Intelligence
Galbot adheres to the principle of developing "practical and functional" intelligent robots. Its robots feature a highly utilitarian design with 7-DOF dual arms and an omnidirectional wheel foldable lift structure, enabling widespread application across commercial, industrial, household, and medical fields. The company has secured cumulative investments exceeding RMB 2.4 billion from leading entities such as Meituan, CATL, China Internet Investment, China Development Bank, BAIC Capital, SAIC Hengxu, and Hong Kong Investment Corporation (HKIC), ranking first in the domestic embodied intelligence sector and achieving unicorn status with a valuation over RMB 10 billion.
Currently, the company has established 15 autonomous robotic distribution warehouses in Beijing. These facilities are compatible with various complex shelf structures, including open shelving, precision retrieval systems, and drawers. Each approximately 40-square-meter store is equipped with over 5,000 types of medicines and 6,000 cargo lanes, with tens of thousands of medicine boxes managed solely by a single Galbot robot. Galbot has also launched its first intelligent retail store operated by robots at the New Zhongguan Darongcheng, supporting the sale of multiple product categories such as coffee, self-made beverages, snacks, cultural and creative products, and medicines. It plans to realize the "Ten Cities, Hundred Stores" initiative by the end of the year.
As the exclusive robotic partner platform, Galbot hosted the International Artificial Intelligence Olympiad, organizing over 80 teams from 63 countries to complete the international competition on its software and hardware platform. The company also recently won the gold medal in the autonomous task scenario competition at the World Humanoid Robot Games. Furthermore, Galbot has established joint ventures, investments, and in-depth industrial application collaborations with industrial giants such as Bosch, Hyundai, CATL, Seres, and Zeekr.
The company has extended its robotic services to medical scenarios, including Xuanwu Hospital and West China Hospital. It has conducted multiple live demonstrations and reported to senior Chinese leaders, including Premier Li Qiang, Vice Premier Ding Xuexiang, Chairman of the National Development and Reform Commission Zheng Shanjie, Beijing Municipal Party Secretary Yin Li, and Director of the National Data Administration Liu Liehong.
With R&D centers in Beijing, Shenzhen, Suzhou, and Hong Kong, Galbot has been selected as a member unit of the AI Standardization Committee under the Ministry of Industry and Information Technology. The company has published nearly 100 cutting-edge academic papers on embodied foundation models globally and holds 11 authorized utility model patents and 6 software copyrights, all of which have been successfully translated into practical applications. Its achievements have been showcased at events such as the World Robot Conference, the China International Fair for Trade in Services (CIFTIS), the International Conference on Intelligent Robots and Systems (IROS), the Zhongguancun Forum, and the Beijing TV Spring Festival Gala.
Galbot has been featured multiple times on programs like "CCTV News Broadcast" and "CCTV News," as well as other media outlets, and has been selected as one of NVIDIA's 14 global humanoid robot partners.
Embodied Intelligence Large Models as a Driving Force for Industrial Upgrading
In contrast to current traditional robotic technologies that are limited to performing simple, repetitive tasks based on fixed programs, the company has achieved a generational leap in capabilities through a closed-loop integration of hardware, data, and algorithmic technologies, enabling general-purpose service robots to possess human-like generalization abilities in operation. The company has pioneered a globally innovative three-layer large model system, whose core concept is to achieve embodied intelligence with rapid response and strong generalization capabilities via a 3D vision small model. The system is structured as follows: the bottom layer is the hardware layer; the middle layer consists of embodied skill models that learn a series of skills-including generalized automatic mapping, generalized map navigation, object grasping, articulated object manipulation, mobile grasping, folding clothes, and hanging clothes-through 3D vision and Sim2Real simulation data; the top layer is a task planning model that employs multimodal large models such as GPT-4V and Emu2 as task planners to invoke the small models in the middle layer. The company's Sim2Real technology enables seamless transfer from virtual to real environments, overcoming real-world constraints by depicting arbitrary scenes and objects, thereby significantly enhancing the robot's generalization capabilities and addressing the global challenge of high-cost and low-efficiency data collection and processing. It has also introduced the world's largest dexterous hand dataset, DexGraspNet, and the massive parts dataset, GAPartNet. Leveraging these two datasets, the three-layer large model system uses 3D vision to obtain point clouds of components, enabling perception, pose estimation, and action combination to achieve cross-scenario, cross-material, cross-form, and cross-placement generalized grasping based on human voice commands, with a current success rate of 95% and an expected future rate exceeding 99.9%. Simultaneously, the company continues to refine its end-to-end multimodal large model, which directly outputs actions. This model maintains high computational efficiency while ensuring high-precision and highly reliable operations across various environments, ultimately achieving superior perception and performance in diverse scenarios and with complex material objects.
Through a "theory-technology-standard" three-level transformation framework, it supports precise operations in household service robots and industrial flexible assembly, among ten major scenarios, and has established six industry standards, promoting the large-scale application of embodied intelligence in smart retail, industrial manufacturing, services for an aging society, and smart manufacturing upgrades. Three key directions form an "operation-navigation-interaction" capability triangle: dexterous hand models solve the challenge of tool functionality implementation, navigation systems expand the boundaries of environmental autonomous cognition, and language-driven interaction bridges the gap between human intent and machine execution. Together, these advancements drive the transformation of robots from "mechanical execution" to "active evolution," marking the dawn of a new era in artificial intelligence transitioning from data-driven to "physical interaction-driven" paradigms.
The World Internet Conference (WIC) was established as an international organization on July 12, 2022, headquartered in Beijing, China. It was jointly initiated by Global System for Mobile Communication Association (GSMA), National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT), China Internet Network Information Center (CNNIC), Alibaba Group, Tencent, and Zhijiang Lab.