,*,,,
1.Key Laboratory of Non?steady Aerodynamics and Flow Control of MIIT,College of Aerospace Engineering,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,P.R.China;
2.School of Mechanical Engineering,Anhui University of Technology,Maanshan 243002,P.R.China
Abstract: A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi &Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.
Key words:discontinuous Galerkin;GPU;compute unified device architecture(CUDA);Navier-Stokes equation;laminar flows
Discontinuous Galerkin(DG)method,pro?posed in the early 1970s[1],has become a popular high order method in recent years due to attractive features including stability,conservation and conver?gence etc[2].Various achievements can be noted in many research fields like computational fluid dynam?ics(CFD)[2-4],computational acoustics[5]and com?putational magneto-hydrodynamics[6].It is reported that the computational consuming time grows rapid?ly with the order of approximation of the high order DG method in comparison to low order methods,such as traditional the finite volume method(FVM)and the finite element method(FEM),which re?strains the further application of the DG method in engineering[7].Therefore,many research activities have focused on improving the efficiency of the DG method through modification and parallelization.In view of modification,some influential schemes,like the implicit time marching scheme[3],the multi?grid scheme[4]etc.,were successfully implemented in the DG method.Speedups of the DG method through parallelization can also be noted in many lit?eratures.Generally,most speedups are achieved on the central processing unit(CPU)with multiple processors[2,7-8],few on the modern graphics pro?cessing unit(GPU)architecture[9-11],and hence GPU acceleration of the DG method is worthwhile to investigate in view of modern computers equipped mostly with multi-GPUs.
Modern GPUs have hundreds of processing cores for specialized graphics rending in parallel,which leads to orders of magnitude higher in memo?ry bandwidth and faster in float-point operations in comparison to those of traditional CPUs.As illus?trated in Fig.1,the peak values of NVIDIA GPU performance are tens of times faster than those of the contemporaneous Intel CPU,and the gap be?tween the NVIDIA GPU and the Intel CPU in peak performance,measured in floating point operations per second(FLOP/s),has increased over the last ten years.The fact that hardware of GPU outper?forms CPU indicates the methods with GPU imple?mentation have the potentiality to achieve high effi?ciency.
Fig.1 Floating point operations per second for CPU and GPU
However,in early years,parallel computing on GPU was a complicated exercise and mainly de?pended on low-level graphics programming.Recent?ly,researchers have been allowed to use high-level programming languages with the development of unified programming models,including OpenCL,OpenACC and the compute unified device architec?ture(CUDA).Among them,CUDA has been rec?ognized as the most popular programming model by many research communities.In CFD community,many low order CFD methods like the finite differ?ence method(FDM)[12],F(xiàn)VM[13],F(xiàn)EM[14]and the meshless method[15]have been ported from CPU to GPU under the CUDA model to achieve speedups of dozens of times.
Research activities can also be noted for GPU implementations of the high order DG method.For instance,Kl?eckner et al.[16]developed a 3D un?structured linear nodal DG code to solve the Max?well equations.Siebenborn et al.[9-10]developed an explicit modal DG code to solve 3D Euler equa?tions.It was later modified by Karakus et al.[11]for solving incompressible laminar flows.It can be not?ed that the developed DG codes are in two types,nodal and modal.In general,a nodal DG scheme is often adopted together with an integral-free ap?proach,which perfectly fits into the GPU program?ming model.However,the integral-free approach is not sufficient for nonlinear partial differential equa?tions(PDEs).Therefore,one has to resort to the modal type of the DG scheme for solving nonlinear PDEs like problems to be solved in this paper.How?ever,the computations of the modal DG solver are usually complicated in comparison to the nodal coun?terpart due to extra numerical integrations at each time step,which often results in the waste of partial GPU threads.Therefore,in order to achieve GPU speedups efficiently,the thread hierarchy with relat?ed data structure to be appropriately designed for dif?ferent DG schemes with different approximate or?ders is still one of the major research problems.
In this paper,efforts are made to develop a modal type of GPU-accelerated DG method for solving two-dimensional laminar flows.Following the work of Fuhry et al.[17],which is developed for solving two-dimensional Euler equations,the pro?gramming models of thread-per-cell and thread-peredge are applied for solving two-dimensional Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs(LLF)scheme[18]and the second scheme of Bassi&Rebay(BR2)[3,19],respectively.A strong stabili?ty preserving(SSP)Runge-Kutta scheme[20]is then used for time marching of numerical solutions.Thus,the thread hierarchy with related data struc?ture can be conveniently managed,which gives the great flexibility for accommodating different DG schemes with different approximate orders.The most important is that the DG scheme can be imple?mented in a unified manner without varying with change of approximation orders.The orders of con?vergence of the resulting GPU-accelerated DG method are first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation.The numeri?cal simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the obtained results are compared with ana?lytical solutions or available experimental and nu?merical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.
Before implementing the DG method on GPU,we provide a brief description of the traditional DG method[3,19]for laminar flow simulations.
In this study,laminar flows are governed by the two-dimensional Navier-Stokes equations,which can be written in differential form as
whereU=(ρ,ρu,ρv,ρE)Tis the vector of con?served variables;andFi(U),F(xiàn)v(U,?U) are the in?viscid and viscous flux terms,respectively,with
whereρis the fluid density;u,vare the velocity components alongxandyaxes,respectively;Eis the total energy per unit mass andpthe pressure.Assume that the fluid is perfect gas,and the equa?tionE=is satisfied,whereγindicates the ratio of specific heat coeffi?cient,for air,γ=1.4.In the viscous flux term,the components of shear stress and heat conduction are defined as
where the subscriptsiandjindicatexandydirec?tions,respectively;Pris the Prandtl number and taken as 0.72 for laminar flows;Cpthe specific heat capacity at a constant pressure.The viscosity coeffi?cientμis calculated by the Sutherland formulation
whereT0=288.15 K,μ0=1.716×10-5kg/ms,S=110.4 K for air.
The DG method used in this paper is the wellknown BR2 scheme[3,19].To apply the BR2 scheme,Eq.(1)needs to be reformulated by replac?ing the gradient of the solution ?Uwith an addition?al independent unknownΘ.Thus,Eq.(1)can then be reshaped as
Then the traditional DG approach is applied to Eqs.(5)and(6),resulting in Eq.(7)and(8)for cellΩe
whereUhis approximated to the numerical solutionUk;pthe approximate order.Uhcan be written as
wherevk(x),the so-called basis or test functions,is a base of the polynomial spacePkandNp=(p+1)(p+2)/2 the total number of these functions.The Tayler basis functions[20]are chosen in present article for their simplicity.
As there is no global continuity requirement forUhin the DG method,numerical fluxesHaux,Hi,Hvin edge integrals ①,②and ③need to be defined to handle this discontinuity
where the(?)+and(?)-are notated to indicate the trac?es from the exterior and the interior of the cell,re?spectively.The inviscid numerical fluxHiis taken as the widely used LLF[18]scheme in this paper for its stability and simplicity.The auxiliary and viscous numerical fluxes,HauxandHv,proposed by Bassi and Rebay in their BR2 scheme are written as
After inserting the centered fluxHauxand inte?grating by parts for integrals ①,Eq.(7)can then be transformed as
It can be noted that the auxiliary variableΘis a sum of the solution gradient ?Uand a global lifting operatorR,Θ=?U+R,whereR=andreare local lifting operators only related to the edgee
A stable factorηis also introduced in the local lifting operator and usually taken as the number of edges in each cell.Thus,Eq.(8)can then be writ?ten as
After several mathematical conversions,the fi?nal formulation of the semi-discrete of the DG meth?od can be written as
To simplify the calculations,all integrals are solved using the Gaussian quadrature rules and com?puted in canonical cells[17]by establishing map func?tions between arbitrary cells in the mesh and the ca?nonical triangleΓ={-1 ≤r,s≤1,r+s≤-1} or canonical quadrangleΓ={-1 ≤r≤1,-1 ≤s≤1} on which the Gaussian quadrature rules are known.These map functionsΨe:Ωe→Γare creat?ed to connect the physical coordinatesx=(x,y)and the ones in the canonical cellr=(r,s) for each cellΩe.Thus,integrations for any functions in the physical spaceΩeare evaluated as
whereDΨe=J=|DΨe| is a Jacobi num?ber and stays positive when edges of the cell are counterclockwise.This method is suitable for cells with straight edges,but for cells with curved edges,special treatment is usually needed.A strategy for curved cells appeared in the work of Lübon et al.[21]is adopted for treating curved triangle and quadran?gle cells involved in the present work.
The time integration of the semi discrete sys?tem in Eq.(18)is accomplished by the SSP Runge-Kutta scheme[22],which is stable for a Courant num?ber less or equal to 1/(2p+1).In the present work,this scheme is adopted and Courant number is fixed to be 1/(2p+1)for all the test cases.
Among the available unified programming mod?els,CUDA is recognized as the most popular one since it provides researchers with a high-level pro?gramming language for GPU computing.In the pres?ent work,the CUDA C programming model is used in GPU implementation of the DG method.Some techniques involved in GPU implementation,in?cluding CUDA subroutine,data structure and thread hierarchy,will be discussed in this section af?ter a brief introduction of the CUDA C program?ming model.
In the present work,a NVIDIA GTX TITAN GPU together with the CUDA C programming model[22]is employed for developing the GPU codes.The GTX TITAN GPU is based on the Ke?pler architecture,which contains a total of 14 streaming multiprocessors(SM)and each SM has 192 CUDA cores.In the CUDA programming mod?el,when a parallel task(usually a kernel function)is executed on GPU,a double-layer-based thread hi?erarchy is created:All threads are organized into a set of thread blocks,and all of these thread blocks are then gathered into a thread grid,as shown in Fig.2.For a specified grid,each block has the same number of threads,which will be sent to different cores in the same SM and be executed in parallel.
Fig.2 Description of thread hierarchy
The efficiency of GPU parallelization is related to the usage of available memories existed on the GPU architecture.The global memory,the register and constant memory are required to be used in a way of careful management due to different access speeds.The slow-access global memory located out?side of the chip has the capacity to store major data and may introduce large latencies.Therefore,the management of global memory is realized by mini?mizing access times to the memory.In specific,global transactions are coalesced by instructing the nearby threads to access the same blocks of memo?ry.Each thread has its own private registers located in the fast-access GPU chips as shown in Fig.3.Al?though registers available in each thread are limited,their capacities are used as much as possible thanks to their fast-access features.As usual,the low ca?pacity constant memory,which is reported to be al?most as fast as registers when all threads in a warp access the same location[23],is used to store a small number of constant data accessed frequently.
Fig.3 Types of memories and their locations
As mentioned above,GPU is ideally suited for computations that can be run on numerous data si?multaneously in parallel,and performs poorly when dealing with logical judgements and branch struc?tures.Therefore,the computing tasks to be execut?ed on GPU should be carefully selected.For the DG method described in Section 1.2,the comput?ing tasks associated with the Runge-Kutta timemarching procedure are the most time-consuming part,and hence are ported to GPU with the use of CUDA C;while the other parts related to pre-or post-processing are still kept on CPU.As shown in Fig.4,according to the Runge-Kutta scheme and the specific DG discretization in space,the timemarching procedure of the DG method has been split into a set of sub-procedures,which could be calculated by kernel subroutines in GPU implemen?tation.In specific,the subroutine,namely DT,is assigned to update the value of time step;the sub?routine ConVars is designed to compute the con?served variables appeared in Eq.(9);the subrou?tine Grad is developed to compute the gradients of conservative variables based on the BR2 scheme;the subroutine Flux is assigned to compute the nu?merical fluxes,including the LLF flux and viscous numerical flux appeared in Eq.(14);the subroutine RHS is designed to compute the righthand side part in Eq.(18);the subroutine UP is programmed to update the solution based on the Runge-Kutta scheme;and the subroutine Res is used to evaluate the computational residuals.
Fig.4 A general procedure of GPU-based DG solver
As discussed in Section 2.1,registers are lim?ited in each thread block,thus,the subroutines mentioned are further split to low level kernels.Corresponding cell-based and edge-based kernels are then designed to compute the quantities related to cells and edges.For instance,the subroutine RHS consists of three kernels,Vol_Rhs_Kernel,Surf_Rhs_Kernel and MulMtx_Rhs_Kernel.Specifi?cally,Vol_Rhs_Kernel is cell-based and used to compute the cell integrals appeared in Eq.(18);Surf_Rhs_Kernel is edge-based and developed to evaluate edge integrals in Eq.(18).The cell and edge integrals are then combined and multiplied by the inverse of mass matrixMin a cell-based kernel MulMtx_Rhs_Kernel.Listing 1 gives the corre?sponding code snippet of subroutine RHS.
Listing 1Code snippet of subroutine RHS
The kernels are also developed in remaining subroutines like ConVars mentioned above.In order to make it clear,the main subroutine of the Runge–Kutta time-marching procedure is given in Listing 2.
Listing 2Code snippet of subroutine SSP_RKDG
It can be noted that the whole procedure of the program mainly depends on the clear modules of these computational subroutines,which results in great flexibility for possible extensions associated with schemes updating in view of modules replace?ment.
2.2.1 Data structure and memory management
As mentioned above,the coalesced memory access is helpful for reducing access times to the global memory.In order to maximize coalesced memory access,data structures are designed in com?ponent-wise manners.Thus,data related to the same component can be stored continuously.For ex?ample,the coefficients with the same order,which are required to be fetched simultaneously,can now be placed side by side in the solution coefficient ar?rayC(Eq.(9)),as shown in Fig.5.The elementof arrayCdenotes the coefficient of celli,corre?sponding to the basis functionvjof themth conserva?tive variables.Therefore,the nearby threads can ac?cess nearby addresses,as illustrated in Fig.6.
Fig.5 Resulting storage pattern of array C
Fig.6 Thread access pattern of array C
Besides,constant data that are used frequently are stored in the constant memory to improve mem?ory throughput.In specific,the Tayler basis func?tionsvkand their gradients ?vktogether with the co?efficients corresponding to the Gaussian quadrature rules in canonical cells are such kind constant data in the present GPU implementation,and they are all stored in constant memory.With the data structure developed and memory management mentioned,the related quantities like numerical fluxes and resid?uals are computed by specific GPU kernels,as list?ed in Listing 1,which will be addressed in next sec?tion.
2.2.2 Key kernels
In order to have a deep understanding of ker?nels constructed in the Section 2.2,three representa?tive key kernels are described in details.Locations of related variables in different memories are listed in Table 1.
Table 1 Notations of variables
The first kernel,namely Flux_Kernel,is edgebased and used to calculate the numerical fluxes at the edge integral points.As shown in Algorithm 1,the thread index,id,is calculated in line 2,the up?per bound is checked in line 3 to avoid possible mis?takes caused by computations beyond the bound.Af?ter that,the loop is carried out over each integral point into four steps.In Step 1,necessary data in the global memory are fetched and assigned to the thread private registers.The boundary conditions are enforced in Step 2.Notice that branch structures are existed with different boundary treatments.In or?der to reduce low-efficient branches among neigh?boring threads from the same thread warp(strictly executing the same instructions synchronously,as the technique details in Ref.[23]describes),a sim?ple sorting of edges is performed during the pre-pro?cessing to gather the same type of boundary edges together,which reduces the amount of low-efficient branches to no more thanNtype.Ntypeis the number of boundary types in the computational domain.In Step 3,the inviscid and viscous numerical fluxes are then calculated by the specific schemes.Finally,the obtained values are sent to the global memory in Step 4.
Algorithm 1__global__ void FluxKernel()
InputGlobal memory:Conserved variables at the edge integral pointsWf;
Gradients of conserved variables at the edge integral points dxWf,dyWf;
Connectivity matricesL,R,Lpoint,Rpoint;Norm vectorsnx,ny.
OutputFlux at the edge integral points.
The second kernel,namely Vol_Rhs_Kernel,is cell-based and used to calculate the cell integral contributions to the right-hand side of Eq.(18).As shown in Algorithm 2,the calculations are accom?plished in three steps.In Step 1,a private arraytRHSvis assigned in each thread.Thus,access times to the global memory can be benefited from this private array.It means that the access times mentioned can be greatly reduced,and hence it is possible for achieving high speedups.Then in Step 2,the calculation loop of the array is performed over each integral point in two sub-steps.In specif?ic,for each integral point within the loop men?tioned,the related data,which are previously in the global memory,are transferred to the registers of the thread as written in Step 2.1.With the available data of the point,the calculations associated with the array are then carried out in Step 2.2.After that,the obtained values of array are finally stored in the global memory in Step 3.
Algorithm 2__global__ void Vol_Rhs_Ker?nel()
InputGlobal memory:Conserved variables at the cell integral pointsWv;
Gradients of conserved variables at the cell integral points dxWv,dyWv;ge?ometry factors at the cell integral points:rx,ry,sx,sy,Ja.
Constant memory:Basis functionsvat the cell integral points and their gradi?entsvr,vs;weighted coefficientwat the cell integral points
OutputCell integral contribution
The last kernel,Surf_Rhs_Kernel,which is edge-based,is designed to compute the edge inte?gral contributions to the right-hand side of Eq.(18).As shown in Algorithm 3,two thread private ar?rays,tRHSLandtRHSR,are allocated in Step 1,as shown in Algorithm 3.Then in Step 2 a loop is per?formed to compute the values of these arrays over each integral point,which is carried out in two substeps similar to the ones as the lines 9 to 16 in Algo?rithm 2.Finally,the obtained values of the arrays are sent to the global memory.Notice that race con?dition[23]may occur in Step 3.Therefore,an atomic operator[23]atomicAdd is used in this kernel to by?pass this problem.
Algorithm 3__global__ void Surf_Rhs_Ker?nel()
InputGlobal memory:Flux at the edge inte?gral points;geometry factors:Js;
Connectivity matricesL,R,Lpoint,Rpoint.
Constant memory:Basis functionsvat the edge integral points;weighted coef?ficientwat the edge integral points
OutputEdge integral contribution
The CPU or GPU codes developed in the double?precision mode with methods described have been verified through simulating a set of typical flow problems,including Couette flows and flows over a circular cylinder or an aerodynamic airfoil.All simu?lations are performed on a Windows desktop plat?form equipped with a NVIDIA GTX TITAN GPU and an Intel core i5-3450 CPU(See Table 2 for their specifications).The performances of the devel?oped codes will be analyzed after presenting numeri?cal results.
Table 2 Specifications of Intel core i5?3450 CPU and NVIDIA GTX TITAN GPU
3.1.1 Couette flows
In order to have a view of the accuracy and con?vergence orders of the developed DG codes,Cou?ette flow between two parallel plates where the up?per plate moves at a constant velocityUand the bot?tom plate is fixed,is firstly selected and simulated with arbitrary grids.Assume that the viscosity coef?ficientμis a constant,the exact solution of this Cou?ette flow can be expressed as
whereHis the distance between two plates andythe distance between the field point and the bottom fixed plate.Following the work in Ref.[24],the oth?er parameters are taken as:The temperature of the bottom plateT0is set to be 0.80 and the temperature of the upper plateT1is taken as 0.85.The Mach number of the upper wallMa∞=0.1 and the Reyn?olds numberRe∞=100.The computational domain(0 ≤x≤2H,0 ≤y≤H,H=2)is discretized with unstructured grids and structured grids.As shown in Fig.7,three successive refinement are carried out for flow simulations with approximate orderpfrom 1 to 3.The differences between the computed density and the exact density is used to measure the order of convergence with theL2errors,which are computed and listed in Table 3.The (p+1)th order of con?vergence of the present DG codes of approximate or?derpcan be achieved as expected.
Table 3 L2 error and convergence order of Couette flow on unstructured and structured meshes
Fig.7 Successively refined meshes (Unstructured meshes:A to C;structured meshes:D to F)
3.1.2 Viscous flow past a circular cylinder
A compressible viscous flow past a cylin?der[25-27]with conditions ofMa∞=0.2 andRe∞=40 is then selected and simulated for further validation.As shown in Fig.8,the computational domainΩof[-15,30]×[-15,15],which is suffi?ciently large to eliminate the influence of far field boundary on the net drag and separation bubble lengths,is discretized with 640 quadrilateral grid cells and 4 254 triangular ones for simulations.The contours of Mach number are computed as shown in Fig.9.The smoothness of the computed contours can be observed to be notably improved as the ap?proximate order increases.Besides,the correspond?ing streamlines of the flow field are also presented in Fig.9 in order to have a clear view of separated bub?bles.The computed total drag coefficientsCDand normalized lengths of separation bubble relative to the diameterdof the cylinder,s/d,are listed in Ta?ble 4,which are all in good agreement with estab?lished results[25-28].
Fig.8 Computational mesh of the circular cylinder
Fig.9 Mach number contours and streamlines with different approximate orders
3.1.3 Viscous flow past NACA0012 airfoil
The last test case presented is a viscous flow past a NACA0012 airfoil.The conditions of the free-stream are Mach numberMa∞=0.5,the angle of attackα=0° andRec=5 000,in which subscriptcdenotes the chord length of the airfoil used as the reference length of the Reynolds number.The adia?batic wall boundary condition is considered in this case.The computational domain is discretized with 170×30 quadrilateral cells(Fig.10)for simula?tions.The computed Mach number contours of the 5th order results(p=4)are presented in Fig.11 along with the streamlines near trailing edge of the airfoil in order to have a clear view of separated flow patterns.It can be noted that flow separations occur near the trailing edge of the airfoil,leading to the generation of two recirculation bubbles in the wake region.The corresponding pressure and skin friction coefficients,CpandCf,are also presented in Fig.12,which are all in good agreement with the re?sult appeared in the open literature[29].Additionally,in order to have a view of the effect of different ap?proximate order,the drag coefficients due to pres?sure,namelyCDp,are further computed and listed in Table 5 together with the drag coefficients due to viscous stressCDvand the total drag coefficientCD.In the column Method,postfixes,P1,P2,P3 and P4,indicate the method approximated by order of 1,2,3 and 4,respectively.It can be noted that the computedCDincreases as the approximate order in?creases like reference data[25,29]listed.In specific,the drag coefficients similar to the results of BR1-P3 and DDG-P2 can be obtained by the BR2-P3 and BR2-P4 schemes presented,and their differences ofCDp,CDv,CDare within 3%,1% and 2%,respec?tively.
Table 5 Comparison of drag coefficients with different approximate orders
Fig.10 Computational mesh of NACA0012 airfoil
Fig.11 Mach number contours and streamlines near the trailing edge with approximate order p=4
Fig.12 Distributions of pressure and skin friction coeffi?cients with p=4
3.2.1 Performance indicators
To have a quantitative comparison of the per?formances of the developed DG codes,an absolute indicator-unified timeTkis firstly defined as
whereσis a constant for scaling the magnitude of the unified time index and is taken as 106in this pa?per;Titerthe computational time of a single iteration for the DG solver;NDOFthe number of degrees of freedom(DOF),which depends on number of cellsNcells,number of conservative variablesNvariablesand number of basis functionsNp.Thus,NDOF=Ncells×Nvariables×Np.As a result,Tkcan be interpreted as the time consumed in a single iteration for one mil?lion DOFs.Then,the performance of the GPU codes compared with its CPU counterparts can be quantified by a relative indicator with speedup ratio(SR)
whereTCPUis the value associated with CPU-based simulations andTGPUthe value associated with GPU-based simulations.It is well known that the key kernel subroutines dominate the whole perfor?mance of the program.Therefore,the benchmark tests with related performance analysis will be firstly carried out for the key kernel subroutines before pre?senting the performances of the whole program.
3.2.2 Performances of key kernel subroutines
For an intuitive impression of the effect of GPU parallelization,the tests of key kernel subrou?tines are carried out by gradually moving the key procedures of the CPU serial code onto the GPU with computational kernels.Without loss of general?ity,the Couette flows presented in Section 3.1.1 are selected for these tests.The computational domain is uniformly covered by 768×256 quadrangle grid cells,which is much greater than the number of pro?cessor cores of the GPU used.Thus,the computa?tions can make full load of the massive parallel GPU architecture.As illustrated in Fig.13,key kernel subroutines of seven procedures,namely DT,Con?Vars,Grad,F(xiàn)lux,RHS,Update,and Res,are successively analyzed.To test the DG method with approximate orderp=1,the code is first executed on the CPU(The first-row ribbon,“all CPU”,in Fig.13(a)),and the DT procedure is then moved onto the GPU using corresponding kernel subrou?tines,while the other procedures are kept on the CPU(The second-row ribbon,“+DT”,in Fig.13(a)).After that,the remaining procedures are incre?mentally moved one by one onto the GPU using CUDA C kernels until all the procedures of the CPU serial code are executed on the GPU(The last row ribbon in Fig.13(a)).The tests of the DG method with other approximate orders are also car?ried out as illustrated in Figs.13(b—d),respective?ly.It can be noted that gradual acceleration can be achieved as more of the procedures are moved onto the GPU.The detailed time costs,Tk,of both CPU-and GPU-based procedures are listed in Ta?ble 6,as well as the corresponding speedup ratio SR.It can also be noted that each procedure can be significantly accelerated on GPU(RowsTCPUandTGPU),appeared to be dozens of times of speedup ratios(Row SR in Table 6).
Table 6 Unified time of key subroutines on CPU and GPU together with corresponding speedup ratios
Fig.13 Accelerated progress of key computational kernels,conducted incrementally by adding more GPU to support the serial CPU implementation
Additionally,a benchmark test of size effect is also performed by changing the number of threads in a block.A typical set of numbers,8,16,32,64,128 and 256,are selected to config?ure the sizes of thread blocks.The values of com?puting-time costs,varying with the size of thread block,are illustrated in Figs.14(a—d)for the present GPU implementation of DG methods with different approximate orders.In Fig.14,different sizes of thread blocks and corresponding unified times are denoted by columnar ribbons and their vertical heights,respectively.It can be learned that an appropriate size of thread block,128,can be se?lected due to obvious parabolic distributions(Fig.14).This number will be fixed in the follow?ing simulations.
Fig.14 Size effect of threadblock
3.2.3 Performances of the proposed approach
The two unstructured and structured cases,re?spectively outlined in Section 3.1.2 and Section 3.1.3,are used here for testing the performances of the whole approach.For both cases,successively re?fined grids are employed for all related CPU and GPU computations in order to have a view of the ef?fects of mesh scales.The corresponding unified time and speedup ratios are listed in Table 7.It can be seen that dozens of GPU speedups,between 14.68 and 69.70 times,are achieved.It can also be ob?served that the unified time of the serial CPU imple?mentations in each test case are kept almost un?changed(the fourth column in Table 7)with the in?creasing size of mesh,while theTkof GPU imple?mentations obviously decrease(the fifth column of Table 7).Fig.15 presents the computational speed?up ratios with respect to different mesh scales.It canbe observed that continual increasement of speedup ratios related to the growing mesh scales can be achieved until the total computing effort exceeds the allowed capacity of GPU architecture.The impres?sive GPU speedups associated with large mesh scales indicate the potentiality of the GPU-accelerat?ed DG method for flow problems with large scales.
Fig.15 Speedup ratios with respect to different mesh scales
Table 7 Unified time of CPU and GPU implementations together with corresponding speedup ratio
The GPU implementation of the DG method with different approximate orders has been devel?oped under the CUDA C programming model and is successfully applied to solve two-dimensional lami?nar flows.The traditional program codes of the DG method can be easily ported from CPU to GPU through adopting the thread-per-cell or thread-peredge models together with component-wise data storage.A resultant GPU-accelerated DG solver has been developed through the use of the designed CUDA C kernels,the constructed data storage structure and the manipulated thread hierarchy.Nu?merical results of the test cases presented can be in good agreement with analytical solution,available experimental data or other computational results re?ported in literature.It can be seen that the perfor?mance of GPU-accelerated DG solver is greatly im?proved with speedup ratios of multiple of 69.70 at most in comparison with that of CPU counterpart.The speedups vary increasingly with the approxi?mate orders until exceeding the allowed capacity of GPU architecture.Besides,the present GPU imple?mentation is developed based on modules,and hence has the flexibility to accommodate any new schemes in view of module replacement.
Transactions of Nanjing University of Aeronautics and Astronautics2022年4期