亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

PowerNet:Efficient Representations of Polynomials and Smooth Functions by Deep Neural Networks with Rectified Power Units

2020-12-10 04:54:12BoLiShanshanTangandHaijunYu

Journal of Mathematical Study 2020年2期

Bo Li,Shanshan Tang and Haijun Yu*,

1 NCMIS&LSEC,Institute of Computational Mathematics and Scientific/Engineering Computing,Academy of Mathematics and Systems Science,Beijing 100190,China.

2 School of Mathematical Sciences,University of Chinese Academy of Sciences,Beijing 100049,China.

3 China Justice Big Data Institute,Beijing 100043,China.

Abstract.Deep neural network with rectified linear units(ReLU)is getting more and more popular recently.However,the derivatives of the function represented by a ReLU network are not continuous,which limit the usage of ReLU network to situations only when smoothness is not required.In this paper,we construct deep neural networks with rectified power units(RePU),which can give better approximations for smooth functions.Optimal algorithms are proposed to explicitly build neural networks with sparsely connected RePUs,which we call PowerNets,to represent polynomials with no approximation error.For general smooth functions,we first project the function to their polynomial approximations,then use the proposed algorithms to construct corresponding PowerNets.Thus,the error of best polynomial approximation provides an upper bound of the best RePU network approximation error.For smooth functions in higher dimensional Sobolev spaces,we use fast spectral transforms for tensor-product grid and sparse grid discretization to get polynomial approximations.Our constructive algorithms show clearly a close connection between spectral methods and deep neural networks:PowerNets with n hidden layers can exactly represent polynomials up to degree sn,where s is the power of RePUs.The proposed PowerNets have potential applications in the situations where high-accuracy is desired or smoothness is required.

Key words:Deep neural network,rectified linear unit,rectified power unit,sparse grid,Power-Net.

1 Introduction

Artificial neural network(ANN)has been a hot research topic for several decades.Deep neural network(DNN),a special class of ANN with multiple hidden layers,is getting more and more popular recently.Since 2006,when efficient training methods were introduced by Hinton et al[1],DNNs have brought significant improvements in several challenging problems including image classification,speech recognition,computational chemistry and numerical solutions of high-dimensional partial differential equations,see e.g.[2–6],and references therein.

The success of ANNs relies on the fact that they have good representation power.The universal approximation property of neural networks is well-known:neural networks with one hidden layer of continuous/monotonic sigmoid activation functions are dense in continuous function spaceC([0,1]d)andL1([0,1]d),see e.g.[7–9]for different proofs in different settings.Actually,for neural network with non-polynomialC∞activation functions,the upper bound of approximation error is of spectral type even using only one-hidden layer,i.e.error rateε=n-k/dcan be obtained theoretically for approximation functions in Sobolev spaceWk([-1,1]d),wheredis the number of dimensions,nis the number of hidden nodes in the neural network[10].However,it is believed that one of the basic reasons behind the success of DNNs is the fact that deep neural networks have broader scopes of representation than shallow ones.Recently,several works have demonstrated or proved this in different settings.For example,by using the composition function argument,Poggio et al[11]showed that deep networks can avoid the curse of dimensionality for an important class of problems corresponding to compositional functions.In the general function approximation aspect,it has been proved by Yarotsky[12]that DNNs using rectified linear units(abbr.ReLU,a non-smooth activation function defined asσ1(x):=max{0,x})need at mostunits and nonzero weights to approximation functions in Sobolev spaceWk,∞([-1,1]d)withinεerror.This is similar to the results of shallow networks with one hidden layer ofC∞activation units,but only optimal up to aO(log|ε|)factor.Similar results for approximating functions inWk,p([-1,1]d)withp＜∞using ReLU DNNs are given by Petersen and Voigtlaender[13].The significance of the works by Yarotsky[12]and Peterson and Voigtlaender[13]is that by using a very simple rectified nonlinearity,DNNs can obtain high order approximation property.It is also proved by E and Wang[14]that thin and deep ReLU networks can approximate analytic functions exponentially fast.Shallow networks do not hold such a good property.Other works show deeper ReLU DNNs have better approximation property include the work by Heetal.[15]and the work by Opschooretal.[16],which relate ReLU DNNs to finite element methods.

A basic fact used in the error estimate given in[12]and[13]is thatx2,xycan be approximated by a ReLU network withO(log|ε|)layers.To remove this approximation error and the extra factorO(log|ε|)in the size of neural networks,we proposed to use rectified power units(RePU)to construct exact neural network representations of polynomials[17].The RePU function is defined as

wheresis a non-negative integer.Whens=1,we have the Heaviside step function;whens=1,we have the commonly used ReLU functionσ1.We callσ2,σ3rectified quadratic unit(ReQU)and rectified cubic unit(ReCU)fors=2,3,respectively.Note that,some pioneering works have been done by Mhaskar and his coworkers(see e.g.[18],[19])to give an theoretical upper bound of DNN function approximations by converting splines into RePU DNNs.However,for very smooth functions,their constructions of neural network are not optimal and meanwhile may not be numerically stable.The error bound obtained is quasi-optimal due to an extra log(k)factor,wherekis related to the smoothness of the underlying functions.The extra log(k)factor is removed in our earlier work[17]by introducing some explicit optimal and stable constructions of ReQU networks to exactly represent polynomials.In this paper,we extend the results to deep networks using general RePUs withs＞2.

Comparing with other two constructive approaches(The Qin Jiushao algorithm and the first-composition-then-combination method used in[18],[19],etc),our constructions of RePU neural networks to represent polynomials are optimal in the numbers of network layers and hidden nodes.To approximate general smooth functions,we first approximate the function by its best polynomial approximation,then convert the polynomial approximation into a RePU network with optimal size.The conclusion of algebraic convergence forWk,2functions and exponential convergence for analytic functions then follows straightforward.For multi-dimensional problems,we use the concept of sparse grid to improve the error estimate of neural networks and lessen the curse of dimensionality.

The main advantage of the ReLU function is that ReLU DNNs are relatively easier to train than DNNs using other analytic sigmoidal activation units in traditional applications.The latter ones have well-known severe gradient vanishing phenomenon.However,ReLU networks have some limitations.E.g.,due to the fact that the derivatives of a ReLU network function are not continuous,ReLU networks are hard to train when the loss function contains high-order derivatives of the network,thus functions with higherorder smoothness are desired.This gives some hints on why ReQU networks are used in the deep Ritz method recently proposed by E and Yu[20]to solve partial differential equations(PDEs).

The remaining part of this paper is organized as follows.In Section 2 we first show how to realize univariate polynomials and approximate smooth functions using RePU networks.Then we construct RePU network realization of multivariate polynomials and general multivariate smooth functions in Section 3,with extensions to high-dimensional functions in sparse space given in Subsection 3.3.In Section 4,we present some preliminary numerical results to verify the numerical accuracy and stability of the proposed methods to construct RePU neural networks.A short summary is given in Section 5.

2 Approximation of univariate smooth functions

2.1 Basic properties of RePU networks

Our analyses rely upon the fact:x,x2,...,xsandxycan all be realized by a one-hiddenlayerσsneural network with a few number of coefficients,which is presented in the following lemma.

Figure 1:The growth of l∞condition number of Vandermonde matrices Vs corresponding to different sets of nodes{bk,k=1,...,s}.The data for optimal symmetric nodes and optimal non-negative nodes are from[22].

Remark 2.1.The inverse of Vandermonde matrix will inevitably be involved in the solution of(2.17),which make the formula(2.11)difficult to use for largesdue to the geometrically growth of the condition number of the Vandermonde matrix[21–23].The condition number of thes×sVandermonde matrices with three different choices of symmetric nodes are given in Figure 1.The three choices for symmetric nodes are Chebyshev nodes

and numerically calculated optimal nodes.The counterparts of these three different choices for non-negative nodes are also depicted in Figure 1.Most of the results are from[22].For largesthe numerical procedure to calculate the optimal nodes may not succeed.But the growth rates of thel∞condition number of Vandermonde matrices using Chebyshev nodes on[-1,1]is close to the optimal case,so we use Chebyshev nodes(2.18)for larges.For smaller values ofs,we use numerically calculated optimal nodes,which are given for 2≤s≤6 in[21]:

Note that,in some special cases,if non-negative nodes are used,the number of activation functions in the network construction can be reduced.However,due to the fact that the condition number in this case is larger than the case with symmetric nodes,we will not consider the use of all non-negative nodes in this paper.

Based on Lemma 2.1,one can easily obtain following results.

Corollary 2.1.A univariate polynomial with degree up toscan be exactly represented by neural networks with one hidden layer of 2sactivation nodes.More precisely,by(2.11),we have

In the implementation of polynomials,operations of the formxnywill be frequently involved.Following lemma asserts thatxny,0≤n≤s-1 can be realized by using only one hidden layer.

Lemma 2.2.Bivariatemonomialsxny,0≤n≤s-1canberealizedasalinearcombinationofat mostunactivationunitsofσs(·)as

AgraphicalrepresentationofissketchedinFigure2(f).Obviously,thenumbersofnonzero weightsinthefirstlayerandsecondlayeraffinetransformationare3unanduncorrespondingly.

Figure 2:Some shallow neural networks used as building bricks of the RePUs DNNs.Here circles represent hidden nodes,squares represent input,output and intermediate variables,A“+”sign inside a circle or a square represents a nonzero bias.

2.2 Optimal realizations of polynomials by RePU networks with no error

The basic properties ofσsgiven in Lemma 2.1 and Lemma 2.2 can be used to construct neural network representation of any monomial and polynomial.We first present the results of monomial.

Forxnwith 1≤n≤s,by Lemma 2.1,the number of layers,hidden units and nonzero weights required in aσsnetwork to realize it is no more than 2,2s,6s+1,correspondingly.Forn＞s,we have the following Theorem.

Figure 3:Sketch of aσs network realization of xn.Here(k),k=1,...,m on the top part represents the intermediate variables of k-th hidden layer(the quantities beneath(k)).

Remark 2.2.It is easy to check that:For any neural network with only one hiddenσslayer,the corresponding neural network function is a piecewise polynomial of degree at mosts,for any neural network withkhiddenσslayers,the corresponding network function is a piecewise polynomial of degree at mostsk.Soxncan’t be exactly represented by aσsneural network with less than-1 hidden layers.

Remark 2.3.The detailed procedure presented in Lemma 2.1 and Theorem 2.1 is implemented in Algorithm 2.1.Note that this algorithm generatesσsDNN to represent monomialxnwith least(optimal)hidden layers.For largenands,the numbers of nodes and nonzero weights in the network is of orderO(s2logsn)andO(s4logsn),respectively,which are not optimal.To lessen the size of the constructed network for larges,one may implementin(2.33)in two steps:i)implement;ii)implement.According to Lemma 2.1 and Lemma 2.2,This will lessen both the number of nodes and the number of nonzero weights in the overall network but will add one-more hidden layer.To keep the paper tight,we will not present the detailed implementation of this approach here.Instead,we will describe this approach in theσsnetwork realization of polynomials.

Now we consider converting univariate polynomials intoσsnetworks.If we directly apply Lemma 2.1 and Theorem 2.1 to each monomial term in a polynomial of degreenand then combine them together,one would obtain a network of depthand size at least,which is not optimal in terms of network size.Fortunately,there are several other ways to realize polynomials.Next,we first discuss two straightforward constructions.The first one is a direct implementation of Horner’s method(also known as Qin Jiushao’s algorithm):

To describe the algorithm iteratively,we introduce the following intermediate variables

Then we havey0=f(x).But an iterative implementation ofykusing realizations given in Lemma 2.1,2.2 and stack the implementations up,we obtain aσsneural network withnlayers and each hidden layer has 4(s-1)activation units.

The second way is the method used by Mhaskar and his coworkers(see e.g.[18,24]),which is based on the following proposition[25,26].

Proposition 2.1.Letm≥0,d≥1 be integers.Then every polynomial indvariables with total degree not exceedingmcan be written as a linear combination ofquantities of the from

Remark 2.4.The Horner’s method and Mhaskar’s Method have different properties.The first one is optimal in the number of nodes but use too many hidden layers;the latter one is optimal in the number of hidden layers,but the number of nodes is not optimal.Another issue in the latter approach is that one has to calculate the coefficientsck,ωk,bkin(2.40),which is not an easy task.Note that,whend=1,Proposition 2.1 is equivalent to Lemma 2.1 and Corollary 2.1,from which we see one need to solve some Vandermonde system to obtain the coefficients.The Vandermonde matrix is known has very large condition number for large systems.A way to avoid solving a Vandermonde system is demonstrated in the proof of Lemma 2.2.However,from the explicit formula given in(5.5)-(5.6),we see whensis big,large coefficients with different signs coexist,which is deemed to have a large cancellation error.So,lifting the activation function fromρstoρndirectly is not a numerically stable approach.

Now,we propose a construction method that avoids the problem of solving large Vandermonde systems.At the same time,the networks we constructed have no very large coefficients.

Consider a polynomialp(x)with degreengreater thans.Let.We first use a recursive procedure similar to the monomial case to construct a network with minimal layers.

The above construction produces a network withm+2 layers which is optimal.But the numbers of nodes and nonzero weights are not optimal for large values ofs.Next,we present an alternative construction method in following theorem that is optimal in both number of layers and number of nodes.

Theorem 2.2.Ifp(x)isapolynomialofdegreenonR,thenitcanberepresentedexactlybyaσs neuralnetworkwithlayers,andnumberofnodesandnon-zeroweightsareoforderO(n)andO(sn),respectively.

Proof.1)For polynomials of degree up tos,the formula(2.25)in Corollary 2.1 presents a one-hidden-layer network realization that satisfies the theorem.

2)Below,we give a realization with much less number of nodes and nonzero weights by adding one-more hidden layer.We describe the new construction in following steps.

i)The first sub-network calculatez0=xsandz0,1=xusing

where the number of nodes in this sub-network isN1=2+2s.

2.3 Error bounds of approximating univariate smooth functions

Now we analyze the error of approximating general smooth functions using RePU networks.LetΩ?Rdbe the domain on which the function to be approximated is defined.For the one dimensional case,we focus onΩ=I:=[-1,1].We denote the set of polynomials with degree up toNdefined onΩbyPN(Ω),or simplyPN.Letbe the Jacobi polynomial of degreenforn=0,1,...,which form a complete set of orthogonal bases in the weightedspace with respect to weightωα,β=(1-x)α(1+x)β,α,β＞-1.To describe functions with high order regularity,we define Jacobi-weighted Sobolev spaceas[27]:

3 Approximation of multivariate smooth functions

In this section,we discuss the approximation of multivariate smooth functions by RePU networks.Similar to the univariate case,we first study the representation of polynomials then discuss the results for general smooth functions.

3.1 Approximating multivariate polynomials

3.2 Error bound of approximations of multivariate smooth functions

3.3 High-dimensional smooth functions with sparse polynomial approximations

In last section,we showed that for ad-dimensional functions with partial derivatives up to orderminL2(Id)can be approximated within errorεby a RePU neural network with complexityO(ε-d/m).Whenmis much smaller thand,we see the network complexity has an exponential dependence ond.However,in a lot of applications,high-dimensional problem may have low intrinsic dimension[29],for those applications,we may first do a dimension reduction,then use theσsneural network construction proposed above to approximate the reduced problem.On the other hand,for high-dimensional functions with bounded mixed derivatives,we can usesparsegridorhyperboliccrossapproximation to lessen the curse of dimensionality.

3.3.1 A brief review on hyperbolic cross approximations

We introduce hyperbolic cross approximation by considering a tensor product function:f(x)=f1(x1)f1(x2)···fd(xd).Suppose thatf1,...,fdhave similar regularity that can be well approximated by using a set of orthonormal bases{φk,k=1,2,....}as

wherecandr≥1 are constants depending on the regularity offi,:=max{1,k}.So we have an expansion forfas

Thus,to have a best approximation off(x)using finite terms,one should take

is the hyperbolic cross index set.We callfNdefined by(3.8)a hyperbolic cross approximation off.

For general functions defined onId,we chooseφkto be multivariate Jacobi polynomials,and define the hyperbolic cross polynomial space as

In practice,the exact hyperbolic cross projection is not easy to calculate.An alternate approach is the sparse grids[31,32],which use hierarchical interpolation schemes to build an hyperbolic cross like approximation of high dimensional functions[33,34].

3.3.2 Error bounds of approximating some high-dimensional smooth functions

Now we discussion the RePU network approximation of high-dimensional smooth functions.Our approach bases on high-dimensional hyperbolic cross polynomial approximations.We introduce a concept ofcompletepolynomial space first.A linear polynomial spacePCis said to be complete if it satisfies the following:There exists a set of bases composed of only monomials belonging toPC,and for any termp(x)in this basis set,all of its derivativesbelongs toPC.It is easy to verify that both the hyperbolic cross polynomial spaceand sparse grid polynomial interpolation space(see[34,35])are complete.For a complete polynomial space,we have the following RePU network representation results.

Theorem 3.4.LetPCbeacompletelinearspaceofd-dimensionalpolynomialswithdimensionn,thenforanyfunctionf∈PC,thereexistsaσsneuralnetworkhavingnomorethan+dhiddenlayers,nomorethanO(n)activationfunctionsandO(sn)non-zero weights,canrepresentfwithnoerror.HereNiisthemaximumpolynomialdegreeini-thdimensionofPC.Weassumethatn?s,n?dandNi≥1foreveryi.

Remark 3.2.Here,we bound the weightedL2approximation error by using the corresponding hyperbolic cross spectral projection error estimation developed in[30].However,high-dimensional hyperbolic cross spectral projection is hard to calculate.In practice,we use efficient sparse grid spectral transforms developed in[34]and[35]to approximate the projection.After a numerical network is built,one may further train it to obtain a network function that is more accurate than the sparse grid interpolation.Note that the fast sparse transform can be extended to tensor-product unbounded domain using the mapping method[36].

4 Preliminary numerical experiments

We present in this section some preliminary numerical results to verify that the proposed algorithms are numerically stable and efficient.

Figure 4:Numbers of hidden layers and nodes inσs(s=1,...,6)PowerNets to realize monomial xn within error 8×10-15.

We first present results of realizing univariate monomialsxnwithin error 8×10-15by PowerNets using different RePUσsin Figure 4,from which we see largersvalues leads to less numbers of hidden layers,and the number of layers in the ReLU(σ1)DNN is much larger than that of otherσsDNNs.Here the ReLU DNN is obtained by replacing everyσ2activation unit by a ReLU DNN subnet approximation in a correspondingσ2DNN realization ofxn.From the right plot in Figure 4,we also observe that the number of nodes required in the ReLU DNN is much larger than that of otherσsDNNs,whileσs(s=2,3,4,5,6)networks have very close numbers of nodes forn＞8.

In Figure 5,we present the numbers of nodes and nonzeros inσs(s=1,...,6)Power-Nets approximating function exp(-x2)basing on polynomial approximations of different degrees.Again,we observe thatσs(s＞1)PowerNets use much less numbers of nodes and nonzero weights than corresponding ReLU DNNs.Fors≥2,the numbers of nodes and nonzero weights in theσsPowerNets have a very weak dependence ons.

In Figure 6,we present theL2errors of usingσs(s≥2)PowerNets to approximate 1-dimensional function exp(-x2)and 2-dimensional function sin(x)sin(y)basing on polynomial approximations of different degrees.The errors are measured in domain[-1,1]and[-1,1]2for 1-d and 2-d problem,respectively.We observe that all theσs(s≥2)Power-Nets have spectral convergence as the classical polynomial approximation,which means our proposed algorithms are numerically accurate and stable.

5 Summary

Figure 5:Numbers of nodes and nonzero weights inσs(s=1,...,6)PowerNets to approximate function exp(-x2)basing on polynomial approximations of different degrees.

Figure 6:The L2 errors of usingσs PowerNets to approximate exp(-x2)and sin(x)sin(y).

In this paper,deep neural network realizations of univariate polynomials and multivariate polynomials using general RePU as activation functions are proposed with detailed constructive algorithms.The constructed RePU neural networks have optimal number of hidden layers and optimal number of activation nodes.By using this construction,we also prove some optimal upper error bounds of approximating smooth functions in Sobolev space using RePU networks.The optimality is indicated by the optimal nonlinear approximation theory developed by DeVore,Howard and Micchelli for the case that the network parameters depend continuously on the approximated function.The constructive proofs reveal a close connection between spectral methods and deep RePU network approximations.

Even though we did not apply the proposed RePU networks to any real applications in this paper,the good properties of the proposed networks suggest that they have potential advantages over other types of networks in approximating functions with good smoothness.In particular,it suits situations where the loss function contains some derivatives of the network function.

Appendix

The appendix section is devoted to proof Lemma 2.2.We first present the following lemma which can be proved by induction.

The length of those coefficients are all 2(s-n)(n+1).The lemma is proved.

Journal of Mathematical Study2020年2期

Journal of Mathematical Study的其它文章: Efficient and Energy Stable Scheme for an Anisotropic Phase-field Dendritic Crystal Growth Model Using the Scalar Auxiliary Variable(SAV)Approach; Efficient Laguerre and Hermite Spectral Methods for Odd-order Differential Equations in Unbounded Domains; High-Accuracy Numerical Approximations to Several Singularly Perturbed Problems and Singular Integral Equations by Enriched Spectral Galerkin Methods; Special Issue In Honor of Professor Jie Shen; (Semi-)Nonrelativisitic Limit of the Nonlinear Dirac Equations