Saturday, June 25, 2022
HomeTechnologyArtificial intelligenceBaidu AI Researchers Introduce SE-MoE Proposing Elastic MoE Training with 2D Prefetch...

Baidu AI Researchers Introduce SE-MoE Proposing Elastic MoE Training with 2D Prefetch and Fusion Communication over Hierarchical Storage

Machine learning and deep learning have gained popularity in fields such as computer vision (CV) and natural language processing (NLP), which require the analysis of large amounts of data such as images and text. As a result, many computational resources are needed for data processing. Therefore, to address the above concern, sparsely activated neural networks based on Mix of Experts (MoE) have been used to train larger models with low or no additional computational resources while achieving better training results.

Besides the benefits of using MEB, there are several challenges faced by MOE models, as described below.

  1. Computational difficulties: MoNE models make training less effective due to mismatch in specialist selection. To avoid this, additional loss, stochastic experts, etc. Various solutions are used, such as However, this leads to a greater emphasis on timing over computation and puts more pressure on CPUs than GPUs.
  2. Communication challenges: Parameter activation in MoE is closely linked to input data. This results in dreadful load imbalance when the data is unbalanced, even if the routing methods are efficient. When multitasking training is required for inter-device communication, load imbalance can cause devices to step out and wait for mutual synchronous communication. However, this causes performance degradation.
  3. Storage limitations: Memory storage in computing devices significantly limits MoE models. Performance of heavily-enabled models is often determined by training time rather than memory required. All storage contains the same type of memory, but varies in I/O latency, resulting in different wait times for parameters. Therefore, the challenge is to develop a unified and effective storage management system to facilitate infrequently activated networks.

Accordingly, to overcome the challenges faced by the MoNE, this article recommends: Innovative unified framework for MoE training and inference. The paper’s major contribution includes a new SE-MoE, a distributed system that can scale MoE models to trillions of parameters and fully leverage clusters, including High-Bandwidth Memory, CPU memory, and SSDs, to achieve effective training planning. Dynamic graph scheduling uses an innovative inference approach based on ring memory to overlap computation and communication as much as possible, resulting in more efficient inference performance for larger scale MoE models without extra machines. Additionally, various methods such as load balancing are used by SE-MoE to improve performance without any additional resources.

MoNE education is shown in Figure 1.

Experiments are divided into two parts: evaluation of training efficiency and inference performance. The results show that the training efficiency of SE-MoE outperforms the standard MoE system DeepSpeed, providing approximately 28% acceleration in single-node training and a minimum of 33% acceleration in multi-node training for MoE models exceeding 100 billion. parameters. Additionally, SE-MoE reduces each tier’s GPU memory usage by roughly 12GB. For inference performance in MoE models with more than 200 billion parameters, SE-MoE provides approximately 13% acceleration over DeepSpeed.

Also, experiments were conducted to evaluate elastic MoE training and to check the effect of embedded partitions on MoE architecture. The results proved that implementing the embedding method in a single system can effectively minimize GPU memory usage. However, the proposed solution reduces GPU memory by 22.4%, 24.2%, and 26.3%, respectively, when the hidden size increases, while increasing throughput by 4.2%, 11.2%, and 15.6%, respectively.

Therefore, this article proposed an inference system that may well require the SE-MoE model, MoE training, and the fields of NLP and CV. The study can be extended to find a combined backup training and inference system that takes parameter severity and scheduling into account in various dimensions. In addition, the unified system will successfully overcome the limitations of communication, processing and storage in sparse training.

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'SE-MOE: A SCALABLE AND EFFICIENT MIXTURE-OF-EXPERTS DISTRIBUTED TRAINING AND INFERENCE SYSTEM'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, and Github.

Please Don't Forget To Join Our ML Subreddit

Do NOT follow this link or you will be banned from the site!