亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization

        2022-05-23 03:02:46XinleiYiShengjunZhangTaoYangTianyouChaiandKarlHenrikJohansson
        IEEE/CAA Journal of Automatica Sinica 2022年5期

        Xinlei Yi, Shengjun Zhang, Tao Yang, Tianyou Chai,, and Karl Henrik Johansson,

        Abstract—The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primal-dual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth(possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate nT)for general nonconvex cost functions and the linear speedup convergence rate O(1/(nT)) when the global cost function satisfies the Polyak-?ojasiewicz (P-?) condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.

        I. INTRODUCTION

        Note that the algorithms proposed in the aforementioned references use at least gradient information of the cost functions, and sometimes even the second- or higher-order information. However, in many applications explicit expressions of the gradients are often unavailable or difficult to obtain. In this paper, we consider the case where each agent can only collect stochastic gradients of its local cost function and propose a distributed stochastic gradient descent (SGD)algorithm to solve (1). In general, SGD algorithms are suitable for scenarios where explicit expressions of the gradients are unavailable or difficult to obtain. For example, in some big data applications, such as empirical risk minimization, the actual gradient is calculated from the entire data set, which results in a heavy computational burden. A stochastic gradient can be calculated from a randomly selected subset of the data and is often an efficient way to replace the actual gradient.Other examples which SGD algorithms are suitable for include scenarios where data are arriving sequentially such as in online learning [10].

        A. Literature Review

        B. Main Contributions

        This paper provides positive answers for the above two questions. More specifically, the contributions of this paper are summarized as follows.

        i) We propose a distributed primal-dual SGD algorithm to solve the optimization problem (1). In the proposed algorithm,each agent maintains the primal and dual variable sequences and only communicates the primal variable with its neighbors.This algorithm is suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions.

        iv) We show in Theorems 4 and 5 that the output of our algorithm with constant parameters linearly converges to a neighborhood of a global optimum when the global cost function satisfies the P-? condition. Compared with [26],[46]–[49], which used the strong convexity assumption, we achieve the similar convergence result under weaker assumptions on the cost function.

        The detailed comparison of this paper with other related studies in the literature is summarized in Table I.

        C. Outline

        The rest of this paper is organized as follows. Section II presents the novel distributed primal-dual SGD algorithm.

        TABLE I COMPARISON OF THIS PAPER TO SOME RELATED WORKS

        II. DISTRIBUTED PRIMAL-DUAL SGD ALGORITHM

        This corresponds to our proposed distributed primal-dual SGD algorithm, which is presented in pseudo-code as Algorithm 1.

        Algorithm 1 Distributed Primal-Dual SGD Algorithm 1: Input: parameters , , .xi,0 ∈Rp vi,0=0p, ?i ∈[n]2: Initialize: and .{αk} {βk} {ηk}?(0,+∞)3: for do i=1,...,n k=0,1,...4: for in parallel do xi,k Ni xj,k j ∈Ni 5: Broadcast to and receive from ;gi(xi,k,ξi,k)6: Sample stochastic gradient ;7: Update by (6a);xi,k+1 8: Update by (6b).9: end for 10: end for{xk}vi,k+1 11: Output: .

        A. Find Stationary Points

        B. Find Global Optimum

        Remark 4:It should be highlighted the P-? constantνis not used to design the algorithm parameters. Therefore, the constantνdoes not need to be known in advance. Similar convergence result as stated in (19) was achieved by the distributed SGD algorithms proposed in [26], [46]–[49] when each local cost function is strongly convex, which obviously is stronger than the P-? condition assumed in Theorem 4. In addition to the strong convexity condition, in [26], it was also assumed that each local cost function is Lipschitz-continuous.Some information related to the Lyapunov function and global parameters, which may be difficult to get, were furthermore needed to design the stepsize. Moreover, in [46]–[49], the strong convexity constant was needed to design the stepsize and in [48], [49], ap-dimensional auxiliary variable, which is used to track the global gradient, was communicated between agents. The potential drawbacks of the results stated in Theorem 4 are that i) we consider undirected graphs rather than directed graphs as considered in [49]; and ii) we do not analyze the robustness level to gradient noise as [46] did. We leave the extension to the (time-varying) directed graphs and the robustness level analysis as future research directions.

        Note that the unbiased assumption, i.e., Assumption 5, can be removed, as shown in the following.

        IV. SIMULATIONS

        In this section, we evaluate the performance of the proposed distributed primal-dual SGD algorithm through numerical experiments. All algorithms and agents are implemented and simulated in MATLAB R2018b, run on a desktop with Intel Core i5-9600K processor, Nvidia RTX 2070 super, 32 GB RAM, Ubuntu 16.04.

        A. Neural Networks

        We consider the training of neural networks (NN) for image classification tasks of the database MNIST [59]. The same NN is adopted as in [28] for each agent and the communication graph is generated randomly. The graph is shown in Fig. 1 and the corresponding Laplacian matrixLis given in (22). The corresponding mixing matrixWis constructed by Metropolis weights, which is given in (23).

        Fig. 1. Connection topology.

        Each local neural network consists of a single hidden layer of 50 neurons, followed by a sigmoid activation layer,followed by the output layer of 10 neurons and another sigmoid activation layer. In this experiment, we use a subset of MNIST data set. Each agent is assigned 2500 data points randomly, and at each iteration, only one data point is picked up by the agent following a uniform distribution.

        We compare our proposed distributed primal-dual SGD algorithm with time-varying and fixed parameters (DPDSGD-T and DPD-SGD-F) with state-of-the-art algorithms:distributed momentum SGD algorithm (DM-SGD) [23],distributed SGD algorithm (D-SGD-1) [26], [27], distributed SGD algorithm (D-SGD-2) [28], D2[36], distributed stochastic gradient tracking algorithm (D-SGT-1) [37], [49],distributed stochastic gradient tracking algorithm (D-SGT-2)[38], [48], and the baseline centralized SGD algorithm (CSGD). We list all the parameters1Note: the parameter names are different in each paper.we choose in the NN experiment for each algorithm in Table II.

        TABLE II PARAMETERS IN EACH ALGORITHM IN NN EXPERIMENT

        We demonstrate the result in terms of the empirical risk function [60], which is given as

        Fig. 2 shows that the proposed distributed primal-dual SGD algorithms with time-varying parameters converges almost as fast as the distributed SGD algorithm in [26], [27] and faster than the distributed SGD algorithms in [28], [36]–[38], [48],[49] and the centralized SGD algorithm. Note that our algorithm converges slower than the distributed momentum SGD algorithm [23]. This is reasonable since that algorithm is an accelerated algorithm with extra requirement on the cost functions, i.e., the deviations between the gradients of local cost functions are bounded, and it requires each agent to communicate threep-dimensional variables with its neighbors at each iteration. The slopes of the curves are however almost the same. The accuracy of each algorithm is given in Table III.

        Fig. 2. Empirical risk.

        TABLE III ACCURACY ON EACH ALGORITHM IN NN EXPERIMENT

        B. Convolutional Neural Networks

        Let us consider the training of a convolutional neural networks (CNN) model. We build a CNN model for each agent with five 3×3 convolutional layers using ReLU as activation function, one average pooling layer with filters of size 2×2, one sigmoid layer with dimension 360, another sigmoid layer with dimension 60, one softmax layer with dimension 10. In this experiment, we use the whole MNIST data set. We use the same communication graph as in the above NN experiment. Each agent is assigned 6000 data points randomly. We set the batch size as 20, which means at each iteration, 20 data points are chosen by the agent to update the gradient, which is also following a uniform distribution.For each algorithm, we do 10 epochs to train the CNN model.

        We compare our algorithms DPD-SGD-T and DPD-SGD-F with the fastest ones for the neural networks case, i.e., DMSGD [23], D-SGD-1 [26], [27], and C-SGD. We list all the parameters we choose in the CNN experiment for each algorithm in Table IV.

        We demonstrate the training loss and the test accuracy of each algorithm in Figs. 3 and 4 respectively. Here we use Categorical Cross-Entropy loss, which is a softmax activation plus a Cross-Entropy loss. We can see that our algorithms perform almost the same as the DM-SGD and better than the D-SGD-1 and the centralized C-SGD. The accuracy of each algorithm is given in Table V.

        TABLE IV PARAMETERS IN EACH ALGORITHM IN CNN EXPERIMENT

        Fig. 3. CNN training loss.

        Fig. 4. CNN accuracy.

        TABLE V ACCURACY ON EACH ALGORITHM IN CNN EXPERIMENT

        V. CONCLUSIONS

        In this paper, we studied distributed nonconvex optimization. We proposed a distributed primal-dual SGD algorithm and derived its convergence rate. More specifically, the linear

        欧美亚洲熟妇一区二区三区| 国产精品成人av电影不卡| 日韩精品中文字幕综合| 综合久久青青草免费观看视频| 亚洲1区第2区第3区在线播放| 精品卡一卡二卡3卡高清乱码| 亚洲欧美日韩中文在线制服| 99精品国产兔费观看久久| 日本办公室三级在线看| 无码人妻久久久一区二区三区| 国产精品亚洲一区二区在线观看| 一区二区无码中出| 日本免费a一区二区三区| 少妇无套裸按摩呻吟无呜| 在线成人一区二区| 欧美a在线播放| 亚洲国产中文字幕九色| 久久国产成人精品av| 欧美日韩亚洲tv不卡久久| 国产精品久久久亚洲第一牛牛| 极品粉嫩嫩模大尺度视频在线播放| 人妻洗澡被强公日日澡电影| 巨熟乳波霸若妻在线播放| 一本色道久久综合狠狠躁中文| 国产真实一区二区三区| 天下第二社区在线视频| 国产欧美一区二区成人影院| 亚洲一区二区三区亚洲| 亚洲小说区图片区色综合网| 亚洲欧美日本| 一区二区无码精油按摩| 在线精品首页中文字幕亚洲| 国产成人精品123区免费视频| 久久精品国产亚洲AV高清特级| 日韩国产一区二区三区在线观看| 人妻中文字幕乱人伦在线| 男女18禁啪啪无遮挡| 亚洲AV无码秘 蜜桃1区| 少妇人妻一区二区三飞| 无码熟妇人妻av在线影片最多| 国产乱子伦精品免费无码专区|