Get Our e-AlertsSubmit Manuscript
Space: Science & Technology / 2021 / Article

Research Article | Open Access

Volume 2021 |Article ID 9879246 | https://doi.org/10.34133/2021/9879246

Qingliang Meng, Meiyu Huang, Yao Xu, Naijin Liu, Xueshuang Xiang, "Decentralized Distributed Deep Learning with Low-Bandwidth Consumption for Smart Constellations", Space: Science & Technology, vol. 2021, Article ID 9879246, 10 pages, 2021. https://doi.org/10.34133/2021/9879246

Decentralized Distributed Deep Learning with Low-Bandwidth Consumption for Smart Constellations

Received02 Aug 2021
Accepted09 Oct 2021
Published31 Oct 2021

Abstract

For the space-based remote sensing system, onboard intelligent processing based on deep learning has become an inevitable trend. To adapt to the dynamic changes of the observation scenes, there is an urgent need to perform distributed deep learning onboard to fully utilize the plentiful real-time sensing data of multiple satellites from a smart constellation. However, the network bandwidth of the smart constellation is very limited. Therefore, it is of great significance to carry out distributed training research in a low-bandwidth environment. This paper proposes a Randomized Decentralized Parallel Stochastic Gradient Descent (RD-PSGD) method for distributed training in a low-bandwidth network. To reduce the communication cost, each node in RD-PSGD just randomly transfers part of the information of the local intelligent model to its neighborhood. We further speed up the algorithm by optimizing the programming of random index generation and parameter extraction. For the first time, we theoretically analyze the convergence property of the proposed RD-PSGD and validate the advantage of this method by simulation experiments on various distributed training tasks for image classification on different benchmark datasets and deep learning network architectures. The results show that RD-PSGD can effectively save the time and bandwidth cost of distributed training and reduce the complexity of parameter selection compared with the TopK-based method. The method proposed in this paper provides a new perspective for the study of onboard intelligent processing, especially for online learning on a smart satellite constellation.

1. Introduction

With the breakthrough development of artificial intelligence and the rapid improvement of onboard computing and storage capabilities, it is an inevitable trend for remote sensing satellite systems to directly generate information required by users through intelligent processing onboard [1, 2]. As earth observation scenes usually present high dynamic characteristics, the traditional training on the ground-prediction onboard working mode cannot satisfy the real-time and accurate perception requirement of users. There is an urgent need to learn and update the intelligent model onboard to adapt to the dynamic changes of the scenes.

Affected by factors such as satellite orbits, payloads, physical characteristics of target objects, and imaging methods, more and more intelligent tasks, such as emergency observations in disaster areas and searching for missing Malaysia Airlines, require the cooperation of multiple satellites. Therefore, relying only on the observation data of a single satellite makes it difficult to achieve precise learning of the global intelligent interpretation model for these cooperation tasks.

Benefited from the development of satellite technology and the reduction of satellite development cost, the number of satellites in orbit has increased sharply and the intersatellite networks have been gradually established, which build the foundation for multisatellite collaboration or smart constellation. Based on the collaborative working mode, it is available to integrate the real-time sensing data and computing capabilities of multiple satellites through distributed deep learning technology. Compared with learning the intelligent model on only one satellite, distributed deep learning can achieve the overall optimization and global convergence of the intelligent model without global information or human intervention and thus improve the collaborative perception and cognitive capabilities of the space-based remote sensing system. Depending on how the tasks are parallelized across satellites, the distributed training can be divided into two categories: model parallelism and data parallelism [3]. Model parallelism means training different parts of networks with multiple workers, which is mainly used for training very large models [4, 5]. In contrast, data parallelism refers to the strategy of partitioning the dataset into smaller splits [6] or collecting data on different devices independently, which is the scenario we study here.

However, due to the particularity of the operating environment of the satellites, which is different from the cluster system on the ground, the network bandwidth of the smart constellation is often very limited. Therefore, it is of great significance and practical urgency to develop distributed deep learning research under a low-bandwidth environment. To deal with this problem, the traditional distributed training methods can be improved from two aspects.

The first aspect is to use decentralized network structures [79]. In the traditional centralized network structure, all nodes need to transmit their trained parameters or gradients of the intelligent model to the central server, waiting for the parameter or gradient fusion, and then receive the fused parameters or gradients from the central server. Instead, the decentralized network structure removes the central parameter server and allows all nodes to exchange parameters or gradients with adjacent nodes. In this way, the pressure of communication can be shared with each node to avoid congestion and improve the real-time capability of distributed training.

The second aspect is to reduce data transmission and save bandwidth usage. This can be achieved by communication delay, quantization, and sparsification. These techniques can be used independently or in combination to develop a comprehensive distributed training framework, such as sparse binary compression [10]. Communication delay means communicating after training several batches locally instead of one batch, which reduces the frequency of communication. This technique is used in Local SGD (Stochastic Gradient Descent) [11, 12], federated averaging [13], and federated learning [14]. Quantization means using a low-precision value to replace the original precise parameters. For example, QSGD [15] (Quantized Stochastic Gradient Descent) adjust the number of bits sent per iteration to smoothly trade off the communication bandwidth against convergence time. The TernGrad approach [16] requires only three numerical levels , which aggressively reduce communication time. DoReFa-Net [17] stochastically quantized gradients to low bitwidth numbers.

This paper mainly focuses on using the sparsification technique to overcome communication bottleneck in a low-bandwidth environment. In the sparsification method, only part of network parameters or gradients is sent. For example, Alistarh et al. [18] proposed sorting the gradients in decreasing order of magnitude and truncating the gradient to its top components. They prove the convergence of this TopK-based method analytically. Deep gradient compression [19] also uses the gradient magnitude as a simple heuristic for importance and employs momentum correction, local gradient clipping, momentum factor masking, and warm-up training to preserve accuracy. Tsuzuku et al. [20] used the variance of gradients as a signal for compression. Adacomp [21] adaptively tunes the compression rate based on local gradient activity. Amiri and Gündüz [22] considered the physical layer aspects of wireless communication and proposed an analog computation scheme A-DSGD (Analog Distributed Stochastic Gradient Descent).

We notice that all these methods choose the magnitude or variance as the indicator of importance, sort the gradients by importance, and then truncate the gradients to top components. For a deep neural network with millions to billions of parameters, this process could be time-consuming due to its high complexity. In this paper, a novel method named RD-PSGD (Randomized Decentralized Parallel Stochastic Gradient Descent) for reducing communication bandwidth by parameter sparsification is proposed. Unlike existing methods utilizing TopK sparsification, in each iteration, we select the parameter to be transferred in a random way, which greatly reduces the complexity of parameter screening. We prove that this strategy can also guarantee the convergence by both theoretical and experimental analysis and optimize the programming to fully leverage the advantage of random parameter sparsification.

The remainder of this paper is organized as follows: Section 2 proposes the RD-PSGD method for smart satellite constellations, and Section 3 presents the programming optimization for the proposed method. Section 4 validates our method by experiments. Conclusions and future works are presented in Section 5.

2. Methodology

In this section, we first introduce the distributed training framework for smart satellite constellations. Then, based on the framework, we briefly review a classic distributed deep learning training method, namely, the D-PSGD (Decentralized Parallel Stochastic Gradient Descent) method [7]. Lastly, motivated by the analysis of the communication complexity of D-PSGD, we propose our RD-PSGD method more suitable for a low-bandwidth satellite constellation environment.

2.1. Distributed Training Framework for Smart Satellite Constellations

The distributed training framework for smart satellite constellations is shown in Figure 1. In the framework, each satellite collects remote sensing images in real time and stores them locally. Besides, each satellite is equipped with an intelligent model to perform a certain perception or cognitive task, such as object detection and scene classification of the collected remote sensing images. If every satellite keeps its intelligent model unchanged, it cannot deal with remote sensing images of dynamic scenes or objects, and thus, it is necessary to learn and update the intelligent model onboard, whereas only training the intelligent model with its own local dataset is hard to achieve overall optimization and global convergence. Instead, for distributed training on a satellite constellation, multiple satellites can be connected by intersatellite links to form a communication network. A satellite is called a worker node in this network. A node not only trains the model using its own dataset but also exchanges and averages model parameters with adjacent nodes. In this way, the perception and cognitive capabilities of the satellite constellation can be fully utilized.

In this paper, we assume that the network is of a fixed ring structure, and there is no centralized parameter server in the system. As we mentioned earlier, this design can effectively avoid congestion of communication. However, the RD-PSGD method proposed later can be easily applied to other network structures.

2.2. A Review of D-PSGD

The distributed training method proposed in this paper is built upon D-PSGD [7], which is a very popular decentralized distributed deep learning technique. Lian et al. [7] proved the convergence of D-PSGD and showed that D-PSGD outperforms centralized algorithms. The D-PSGD [7] considers the following stochastic optimization problem: where is the training dataset and is a data sample. denotes the serialized parameter vector of an intelligent model with a specified deep learning network architecture, which is usually a convolutional neural network (such as ResNets [23]), and is a predefined loss function. The above optimization problem can be efficiently and effectively solved by the SGD algorithm [24].

To design parallel SGD algorithms on a decentralized network, the data are distributed onto all nodes such that the original objective defined in (1) can be rewritten as

Define with . There are two ways to distribute : shared data, ; local data with the same distribution, i.e., , has the same distribution of on local data, which is used in this paper.

2.2.1. Definitions and Notations

Throughout this paper, the following definitions and notations are used: (i) denotes the norm of a vector or the spectral norm of a matrix(ii) denotes the Frobenius norm of a matrix(iii) denotes the column vector of with for all elements(iv) denotes the size of a set (v)(vi) denotes the set of locations of nonzero entries in vector (vii) denotes the gradient of a function (viii) denotes the parameter vector at iteration on the th node(ix) denotes the concatenation of the local parameter vectors at iteration (x) denotes the weight matrix, i.e., the network topology, satisfying (i) and (ii) . We use to denote the degree of network . For a ring-structured network, is denoted by (xi) denotes the matrix , where is a vector of independent Bernoulli random variables, each of which has probability of (xii) denotes the binomial distribution(xiii)(xiv)

By the definitions and notations, we have the D-PSGD in Algorithm 1.

Input: Initial parameter guess , learning rate , weight matrix and maximum iteration .
Output: .
1: fordo
2: Randomly sample from local dataset ;
3: Compute gradient at current optimization parameters: ;
4: Compute neighborhood weighted average by fetching optimization parameters from neighbors: ;
5: Update ;
6: end for
2.3. RD-PSGD

It is easy to check that the communication complexity of D-PSGD is . Whereas in a network with low communication capacity, D-PSGD may suffer from latency. Here, we introduce a random transferring technique to reduce communication, named RD-PSGD (Randomized Decentralized Parallel Stochastic Gradient Descent). Specifically, in the process of model synchronization with adjacent nodes, only a part of the parameters of the intelligent model need to be randomly selected for transmission, which reduces the bandwidth cost and the complexity of parameter filtering compared with the TopK-based methods [18, 19]. The details are stated in Algorithm 2.

Input: Initial parameter guess , learning rate , weight matrix and maximum iteration .
Output: .
1: for
2: At the-th node, construct a vector , where each entry has probability of , and then transfer this vector to other nodes;
3: Randomly sample from local dataset ;
4: Compute gradient at current optimization parameters: ;
5: Compute neighborhood weighted average by fetching optimization parameters indicated in from neighbors: , ;
6: Let , ;
7: Update ;
8: end for

Now, we prove the convergence of RD-PSGD. Firstly, in D-PSGD, it makes a commonly used assumption on the weight matrix.

Assumption 1. is a symmetric stochastic matrix with .
From a global view, at the th iteration, Algorithm 1 can be viewed as

To prove the convergence of D-PSGD, it needs a critical property on the weight matrix , i.e., Lemma 5 in the original publication of D-PSGD [7], which we reformulate as follows:

Lemma 2. Under Assumption 1, for , we have .

Similarly, at the th iteration, Algorithm 2 can be viewed as

Denote the neighborhood weighted average as with the sparsity ratio (). Then, for each node, we just need to transfer the corresponding information specified in . Then, the communication complexity is almost . When , the communication complexity is reduced. Denote the operator with times composite of . Compared with D-PSGD, we need to prove a similar property of like in Lemma 2, to complete the proof of the convergence of RD-PSGD:

Lemma 3. Under Assumption 1, for , we have with probability at leastwhere .

Proof. Denote the random vector used to construct in the th operator in . Let , each entry of which indicates whether the th entry of local optimization parameters will be averaged at the th iteration in Algorithm 2. Thus, is a vector of independent Bernoulli random variables. Denote and the th row of matrix . Denote the th row of matrix . By (6) and the definition of , and , we have Let . Since is doubly stochastic and , we have , . Then, we have where denotes the times of in . Since , we have If , we have then it is easy to check Thus, combined with that are independent, yielding This ends the proof.

Denote which represents the decaying rate under probability than . Suppose empirical, we have

Figure 2 shows the numerical and theoretical in a ring network of 8 nodes. The value of varies from to with step under different minimal probability . The theoretical and numerical values of the convergence rate are in good agreement, indicating that RD-PSGD can meet the convergence requirements.

Assuming that the average speed of the intersatellite link is 2 MB/s, the effect of utilizing RD-PSGD is shown in Table 1. It can be seen that RD-PSGD can reduce bandwidth and time cost linearly and thus make distributed training in low-bandwidth environment practical.


Bandwidth (MB)Time (s)

ResNet-20 [23]D-PSGD45.722.8
RD-PSGD4.62.3

ResNet-50 [23]D-PSGD97.748.8
RD-PSGD9.84.9

3. Programming Optimization

The refinement of the proposed RD-PSGD algorithm over D-PSGD also introduces additional programming problems. For the D-PSGD method, a full cycle of parameter transmission for model synchronization, i.e., step 4 in Algorithm 1, can be divided into three parts: serialization of the parameters of the intelligent model (), communication of the parameters (), and deserialization of the parameters to recover the deep learning network structure of the intelligent model (); namely, the time cost of each cycle of parameter transmission for D-PSGD is

For RD-PSGD, aiming at low-bandwidth communication of the parameters (), extra steps are needed, including generation of the random index that indicates which parameters need to be transmitted and transmission of the random index (), i.e., step 2 in Algorithm 2, and extraction of parameters to be transmitted according to the random index () and expansion of the extracted sparse parameters into dense network parameters () in steps 5 and 6 in Algorithm 2; namely, the time cost of each cycle of parameter transmission for RD-PSGD is

Then, the difference between and is

Therefore, RD-PSGD has lower time complexity only if

Equation (19) shows that (i)the generation and transmission of the random index and the extraction and expansion of the parameters of the intelligent model should be optimized as far as possible to give full play to the acceleration effect of RD-PSGD(ii)the lower the network bandwidth, the higher the value of , thus the more obvious the acceleration

In order to improve the acceleration effect of RD-PSGD, we optimize the programming from two aspects.

Random index generation. We first observe step 2 of Algorithm 2, in which the random index vector is generated from Bernoulli distribution with probability . Suppose the total number of the parameters of the intelligent model is , this direct approach has to perform times of random number generation and thresholding regardless of . Consider is a binary vector, the indices of elements with value 1 are denoted by where the size of is . The difference of adjacent elements of is

We can infer that the elements of obey geometric distribution. Therefore, if we transform the random index vector into , we need only perform times of random number generation. Another advantage of transforming into is reducing bandwidth cost when the sparsity ratio is high. For example, when , an 8-bit integer is enough to represent the value of each element in . In total, employ bit of data, which is less than bit using 0-1 Boolean representation.

Parameter extraction. Regarding serialization, deserialization, extraction, and expansion operations in steps 5 and 6 of Algorithm 2, the general implementation refers to a series of time-consuming for-loop operations. Aiming at speeding up RD-PSGD, the built-in functions in Numpy and PyTorch is used as much as possible to take advantage of dedicated CPU and GPU acceleration for vector and tensor operations.

4. Experiments

We evaluate our RD-PSGD methods on several distributed training tasks for image classification on different benchmark datasets and deep learning network architectures by ground simulation. Specifically, we studied the performance of ResNet-20 [23] and VGG-16 [25] on CIFAR10 [26] and ResNet-50 [23] on ImageNet-1k [27]. In our experiments, we test RD-PSGD on a ring-structured network which consists of 8 worker nodes, each of which is simulated by a workstation with a RTX 3070 GPU. The dataset is randomly split into 8 subsets to simulate the different data collected by each satellite. The algorithm is implemented using PyTorch with Gloo as a communication backend.

Models are trained using SGD with momentum and weight decay on every single node. The setup of hyperparameter is as follows: (1)Batch size: 32 for ResNet-20, 128 for VGG-16, and 64 for ResNet-50(2)Learning rate : For ResNet-20, starting from 0.1, and divided by 10 at the 80th and 120th epoch. For VGG-16, starting from 0.5 and divided by 2 every 25 epochs. For ResNet-50, starting from 0.1 and divided by 10 every 30 epochs(3)Number of epochs and iterations: For ResNet-20, the maximum number of epochs is 200, and there are 196 iterations per epoch on a single node. For VGG-16, there are 200 epochs and iterations in total. For ResNet-50, there are 90 epochs and iterations(4)Momentum: 0.9(5)Weight decay: 0.0001(6)Synchronization delay: For ResNet-20 and VGG-16, the parameter synchronization is performed after every batch, while for ResNet-50, the parameter synchronization is performed after every 100 batches

The convergence and bandwidth saving effect of RD-PSGD are analyzed. The acceleration effect of programming optimization is evaluated, and the time cost is compared with the TopK-based methods [18, 19].

4.1. Convergence of RD-PSGD

Figures 35 show the convergence of the loss function and prediction accuracy of different models with different sparsity ratios using RG-PSGD. As shown in Figure 3, training ResNet-20 on CIFAR-10 can achieve convergence under different sparsity rates with no accuracy loss. When the sparsity ratio is 0.1, i.e., the transmitted model parameters are reduced by 10 times, the training accuracy can still reach more than 90%. A similar phenomenon also presents when training VGG-16 on CIFAR-10 and training ResNet-50 on ImageNet-1k, as shown in Figures 4 and 5. These results demonstrate that the proposed RD-PSGD method can converge on different distributed training tasks under different sparsity rates.

4.2. Bandwidth Cost

The bandwidth cost of one full cycle of transmitting ResNet-50 from one node to another is shown in Figure 6. When the sparsity ratio is close to 1, the bandwidth cost of RD-PSGD is higher than D-PSGD, because an extra vector containing the indices of parameters is transmitted. After the critical value around 0.8, as the sparsity ratio continues to descend, the bandwidth cost decreases approximately linearly.

4.3. Programming Optimization

We evaluate the time cost of training ResNet-50 on ImageNet for one epoch using different methods. Table 2 shows that, without programming optimization, the time cost of RD-PSGD is even higher than D-PSGD. After the programming optimization is applied, the average time cost of generation of random index reduces from 0.432 s to 0.056 s, which is speeded up by . Meanwhile, the average time cost of extraction and expansion of parameters reduces from 12.855 s to 0.431 s, which is speeded up by . And the whole speed up effect of programming optimization is shown in Table 2, where the time cost of RD-PSGD is reduced to 781.15 s, less than that of D-PSGD.


1 node8-node D-PSGD8-node RD-PSGD ()

Time cost(s) (no opt.)6069.36838.031078.5
Time cost(s) (with opt.)6069.36838.03781.15

As we mentioned earlier, the lower the bandwidth, the more obvious the acceleration. To prove this, we use the Trickle tool to limit the bandwidth to no more than 200 kb/s. We define the time cost of parameter synchronization in one epoch as the whole time cost deducting the time needed for gradient calculation and backpropagation. The result in Table 3 shows that the speed-up ratio is indeed higher when the bandwidth is lower, which is more relevant to the sparsity rate.


D-PSGDRD-PSGD ()

High bandwidth87.733.9
Low bandwidth ()285.658.3

4.4. Comparison with the TopK-Based Methods

Table 4 shows the time cost of parameter extraction of the TopK-based methods [18, 19] and RD-PSGD at sparsity ratios 0.1 and 0.5. The result indicates that RD-PSGD can accelerate the parameter extraction by compared with the TopK-based methods, through selecting the parameter to be transferred in a random way instead of screening the parameters according to their magnitudes, which requires sorting operation of high time complexity.


ResNet-20VGG-16ResNet-50

TopK0.00600.3690.659
RD-PSGD0.00150.0590.099

TopK0.01340.9321.619
RD-PSGD0.00330.1990.419

5. Conclusion and Future Work

This paper proposed RD-PSGD, a decentralized distributed training algorithm with low-bandwidth consumption for a smart constellation that randomly selects a part of model parameters to transmit. We prove the convergence of this algorithm theoretically and optimize the programming to further speed up the practical application. The experiment results show that the convergence and acceleration requirements in a low-bandwidth environment can be met, and the algorithm can outperform the TopK-based method in parameter extraction, which shows that this is a promising method for future distributed deep learning on a space-based remote sensing system.

The work in this paper can be improved in the future. Firstly, the algorithm is tested on distributed training tasks on a labeled dataset, while the data used for onboard training are usually unlabeled. The algorithm can be extended for semisupervised or unsupervised training. Secondly, our current experiment is conducted in a cluster environment with fixed network topology and homogeneous nodes on the ground, using software to simulate the low-bandwidth intersatellite network. We will study the performance of the algorithm in a dynamic heterogeneous network and carry out the onboard verification and corresponding engineering optimization research in the future.

Data Availability

The datasets used in this paper include CIFAR10 (http://www.cs.toronto.edu/~kriz/cifar.html) and ImageNet-1k (https://image-net.org/challenges/LSVRC/2012/).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Authors’ Contributions

X. Xiang and N. Liu conceived the idea of this study and supervised the study. X. Xiang performed theoretical proof and numerical analysis. Q. Meng, M. Huang, and X. Yao conducted the experiments. Q. Meng and M. Huang performed data analysis. X. Xiang, Q. Meng, and M. Huang wrote the manuscript. All authors discussed the results and contributed to the final version of the manuscript. Qingliang Meng and Meiyu Huang contributed equally to this work.

Acknowledgments

This is supported by the Beijing Nova Program of Science and Technology under Grant Z191100001119129 and the National Natural Science Foundation of China 61702520.

References

  1. G. Giuffrida, L. Diana, F. de Gioia et al., “Cloudscout: a deep neural network for on-board cloud detection on hyperspectral images,” Remote Sensing, vol. 12, no. 14, p. 2205, 2020. View at: Publisher Site | Google Scholar
  2. H. Li, H. Zheng, C. Han, H. Wang, and M. Miao, “Onboard spectral and spatial cloud detection for hyperspectral remote sensing images,” Remote Sensing, vol. 10, no. 1, p. 152, 2018. View at: Publisher Site | Google Scholar
  3. H. Zhang, Machine Learning Parallelism Could Be Adaptive, Composable and Automated, [Ph.D. thesis], Carnegie Mellon University, 2020.
  4. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019.
  5. T. Brown, B. Mann, N. Ryder et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. View at: Google Scholar
  6. X. Jia, S. Song, W. He et al., “Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes,” 2018, https://arxiv.org/abs/1807.11205. View at: Google Scholar
  7. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” Advances in Neural Information Processing Systems, pp. 5330–5340, 2017. View at: Google Scholar
  8. H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: decentralized training over decentralized data,” in International Conference on Machine Learning, PMLR, pp. 4848–4856, 2018. View at: Google Scholar
  9. X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in International Conference on Machine Learning, PMLR, pp. 3043–3052, Stockholm, Sweden, 2018. View at: Google Scholar
  10. F. Sattler and S. Wiedemann, “Sparse binary compression: towards distributed deep learning with minimal communication,” in 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, Budapest, Hungary, 2019. View at: Publisher Site | Google Scholar
  11. S. U. Stich, “Local sgd converges fast and communicates little,” in 7th International Conference on Learning Representations, pp. 1–17, New Orleans, USA, 2019. View at: Google Scholar
  12. T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local sgd,” in 8th International Conference on Learning Representations, pp. 1–13, Addis Ababa, Ethiopia, 2020. View at: Google Scholar
  13. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, pp. 1273–1282, Seattle, Washington, USA, 2017. View at: Google Scholar
  14. J. Konečný, M. M. HB, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: strategies for improving communication efficiency,” 2016, https://arxiv.org/abs/1610.05492. View at: Google Scholar
  15. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, pp. 1709–1720, 2017. View at: Google Scholar
  16. W. Wen, C. Xu, F. Yan et al., “Terngrad: ternary gradients to reduce communication in distributed deep learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1508–1518, Long Beach, CA, USA, 2017. View at: Google Scholar
  17. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients,” 2016, https://arxiv.org/abs/1606.06160. View at: Google Scholar
  18. D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 5977–5987, Montreal Convention Center, Montreal, Canada, 2018. View at: Google Scholar
  19. Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: reducing the communication bandwidth for distributed training,” in 6th International Conference on Learning Representations, pp. 1–14, Vancouver, BC, Canada, 2018. View at: Google Scholar
  20. Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efficient distributed deep learning,” in 6th International Conference on Learning Representations, pp. 1–12, Vancouver, BC, Canada, 2018. View at: Google Scholar
  21. C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan, “Adacomp: adaptive residual gradient compression for data-parallel distributed training,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2827–2827, New Orleans, Louisiana, USA, 2018. View at: Google Scholar
  22. M. Mohammadi Amiri and D. Gunduz, “Machine learning at the wireless edge: distributed stochastic gradient descent over-the-air,” IEEE Transactions on Signal Processing, vol. 68, pp. 2155–2169, 2020. View at: Publisher Site | Google Scholar
  23. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, Las Vegas, NV, USA, 2016. View at: Publisher Site | Google Scholar
  24. O. Dekel, G. B. Ran, O. Shamir, and L. Xiao, “Optimal distributed online prediction using minibatches,” JMLR, vol. 13, no. 1, pp. 165–202, 2012. View at: Google Scholar
  25. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, pp. 1–14, San Diego, CA, USA, 2015. View at: Google Scholar
  26. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Handbook of Systemic Autoimmune Diseases, vol. 1, no. 4, 2009. View at: Google Scholar
  27. O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. View at: Publisher Site | Google Scholar

Copyright © 2021 Qingliang Meng et al. Exclusive Licensee Beijing Institute of Technology Press. Distributed under a Creative Commons Attribution License (CC BY 4.0).

 PDF Download Citation Citation
Views114
Downloads52
Altmetric Score
Citations