Wei?Guang Li ·Cheng Chang ·Yao Qin ·Zi?Lu Wang ·Kai?Wen Li ·Li?Sheng Geng ·Hao Wu
Abstract In recent years, graphics processing units (GPUs) have been applied to accelerate Monte Carlo (MC) simulations for proton dose calculation in radiotherapy.Nonetheless, current GPU platforms, such as Compute Unified Device Architecture(CUDA) and Open Computing Language (OpenCL), suffer from cross-platform limitation or relatively high programming barrier.However, the Taichi toolkit, which was developed to overcome these difficulties, has been successfully applied to high-performance numerical computations.Based on the class II condensed history simulation scheme with various proton–nucleus interactions, we developed a GPU-accelerated MC engine for proton transport using the Taichi toolkit.Dose distributions in homogeneous and heterogeneous geometries were calculated for 110, 160, and 200 MeV protons and were compared with those obtained by full MC simulations using TOPAS.The gamma passing rates were greater than 0.99 and 0.95 with criteria of 2 mm, 2% and 1 mm, 1%, respectively, in all the benchmark tests.Moreover, the calculation speed was at least 5800 times faster than that of TOPAS, and the number of lines of code was approximately 10 times less than those of CUDA or OpenCL.Our study provides a highly accurate, efficient, and easy-to-use proton dose calculation engine for fast prototyping, beamlet calculation, and education purposes.
Keywords Proton therapy ·Monte Carlo dose calculation ·GPU acceleration ·Taichi
The availability of proton therapy has dramatically increased in recent years.According to the 2021 statistics of the Particle Therapy Co-Operative Group(PTCOG) [1],98 proton therapy facilities are operational worldwide, 19 facilities are under construction, and 4 facilities are being planned.Owing to the sharp falloff in the dose distribution(Bragg Peak), a proton beam can deliver most of the dose to tumors, thus protecting the surrounding healthy tissues.In addition, with the development of new types of beamline systems, such as Huazhong University of Science and Technology Proton Therapy Facility (HUST-PTF) [2]and treatment techniques, such as flash therapy [3], proton dose calculation requires much higher accuracy than that in conventional photon radiotherapy [4].Most current commercial proton-dose calculation engines used in treatment planning systems are based on analytical algorithms which can rapidly yield accurate dose results for homogeneous tissues.However, the predicted results are inaccurate when highly heterogeneous tissues are considered[5].In addition, proton therapy produces many secondary particles, the information on which is of great importance for analyzing their physical and biological effects [6].The analytical algorithms fail to provide detailed information on secondary fragments.
In contrast, Monte Carlo (MC) algorithms fully simulate particle reactions and model transport geometries, achieving accurate total and fragment dose results.Thus, MC simulations are considered the gold standard in dose calculations,although they are time-consuming.Currently, widely used MC tools, such as Geant4, Gate, Topas, and FLUKA, are based on Central Processing Units (CPUs).However, graphics processing units (GPUs) can integrate more computing units, which can simulate transport processes in parallel[7–9] and significantly reduce calculation time.Jia et al.developed the first GPU-based MC proton-dose calculation engine, called gPMC [10].The physics model was adapted from Fippel and Soukop [11], where proton transport was simulated in the class II condensed history scheme with a continuous slowing down of approximations.Ionization and elastic and inelastic interactions between the protons and materials were included.Chan et al.[12] applied a more precise nonelastic nuclear reaction model that incorporated the Bertini cascade and evaporation kernels by sacrificing speed.Both models are CUDA-based, and their applications are limited to Nvidia GPUs.New versions of gPMC, gPMC v2.0, and goMC were realized in the OpenCL framework[13, 14], facilitating cross-platform applications.Although OpenCL supports multiple GPU hardware, it is mainly developed using an Application Programming Interfaces(API) with a specific hardware development interface.
In 2019, Hu et al.proposed a new parallel programming language called Taichi for high-performance numerical computations [15–17].It supports most mainstream GPUs and thus works well in cross-platform environments.Taichi is embedded in Python and uses just-in-time (JIT) compiler frameworks to offload computing-intensive Python code to native GPU or CPU instructions.Using the @ti.kerneldecorator, Taichi's JIT compiler automatically compiles Python functions into an effi-cient GPU or CPU machine code for parallel execution.These features enable easy programming that significantly reduces the total number of lines of code compared to OpenCL and CUDA.Moreover, Taichi can be well integrated into mainstream Python deep learning ecosystems such as PyTorch,which integrate GPU high-performance computing with deep learning neural networks [18–20].Consequently, Taichi has been widely used in high-performance physics studies [16]and machine learning fields [21].However, Taichi has not yet been applied to dose calculations in radiotherapy.In this work, we developed a new cross-platform GPU-based MC proton dose calculation engine for the Taichi platform, named Taichi Proton Monte Carlo (TPMC), with a model of proton transport similar to those in [11, 14], except for a different inelastic nuclear reaction model.
The strategy for GPU acceleration in Taichi is similar to those in OpenCL and CUDA.Primary and secondary stacks were built to store primary protons and secondary particles,respectively.The GPU calculation threads first processed the primary protons to start transport and then saved all the secondary particles produced into the secondary stacks.After all the primary protons stopped or escaped from the region of interest, the secondary protons were transported on the GPU threads again and treated as primary protons, except that the information of the produced secondaries was not documented.Using the Taichi platform reduced the number of lines of code to only a few hundred.The computation time was reduced by allocating the deposited dose to different local dose counters to avoid the atomic addition effect.In our study, we prepared 110-, 160-, and 200 MeV proton pencil beams and four water and human tissue phantoms for dose calculation.The gamma passing rates, calculation speeds, and total number of lines of code were evaluated and compared with those obtained from Geant4-TOPAS and other GPU-based simulations.
The remainder of this study is organized as follows.In Sect.2, we describe the physical process of the TPMC, parallel simulation strategy, and Taichi GPU memory arrangement.The results of the dose calculations for Taichi and the other platforms are compared in Sect.3.Finally, in Sect.4, we conclude our study and discuss its current and future applications.
In this study, we employed the class II condensed history simulation scheme [22] and continuous slowing down approximations for proton transport.A simplified physics model was established to simulate the nuclear reactions of the primary and secondary protons.The physical process of proton transport was almost the same as those in Refs.[10, 11], which will be briefly introduced in the following section.On the other hand, the proton transport was accelerated in the Taichi framework on GPUs, where parallel threads were applied and a new dose counter model was developed to solve thread-block problems.
Protons were transported step by step in materials.The step size is defined as
where ΔSmaxis the maximum step size required to ensure that energy loss remains within 25% of the total kinetic energy or 2 mm [22] in each step.ΔSvoxis the distance between the particle position and voxel surface,ξis a random number in the range of [0–1], andσtotis the total reaction cross section.The step length in different materials ΔSwas first converted to its corresponding water equivalent length ΔSw(with the same energy loss in water)
whereLw(E) is the restricted stopping power of water,which was calculated using the Bethe-Bloch formula [25].To consider the effects of secondary electrons in the ionization process less than the cutoff energyTmine = 0.1 MeV, the actual energy loss was varied with a Gaussian fluctuation ΔEf=ΔE+N(0,σ2) , where the standard deviationσwas determined following Ref.[23] In addition, all secondary electrons were considered depositing their energy locally,as their short range compared to the computed tomography(CT) voxel size.Post-step reactions between protons and materials occurred after each step, including ionization,proton–proton elastic nuclear interactions, proton–oxygen elastic nuclear interactions, and proton–oxygen inelastic nuclear interactions.These interactions were treated in the same manner as in Ref.[10], except that the inelastic cross sections were extracted from MC simulations using Geant4[26] and tabulated as inputs to accelerate the MC simulation,as suggested by Wan et al.[12].
As mentioned in Sect.1, the implementation of CUDA is limited to the GPU of NVIDIA.Although OpenCL can solve this problem [13], it requires an additional application programming interface and lacks functional libraries,thereby creating higher programming barriers for medical physicists and developers.However, Taichi can overcome the drawbacks of these two languages [15].It is based on Python and can therefore conveniently use available packages such asrandom,numpy, andpandas.When applied to GPU acceleration, an easy outermost scope for-loop command automatically parallelizes work items on different threads in the Taichi kernel, whereas in CUDA and OpenCL, appropriate GPU machine codes must be designed[15].Moreover, switching from Nvidia cards to other GPU cards can be achieved simply by changing only the head files in Taichi, instead of (almost completely) rewriting the code in OpenCL.Finally, the compilation time and memory space are saved in Taichi because it is performed on the LLVM-based compiler Clang [15], and Taichi also tunes the parameters to best explore the GPU architecture.Table 1 briefly compares these three languages.
The cross-sectional data of proton-nucleus reactions and the density and stopping power distribution of the target were first loaded by the CPU and stored in the GPU global memory, which was called constantly during the simulations.The transport of primary protons was randomly simulated stepwise on different GPU threads until they stopped or moved out of the target.The automatic parallelization of particles on a GPU can be achieved using @ti.for-loops command in TPMC.Secondary particles were generated after inelastic nuclear interactions, and their information, including the location, energy, and direction, was recorded in the GPU global memory and pushed into the secondary particle stack.The transport of secondary particles was blocked until all the primary protons were simulated.Subsequently, we randomly allocated secondary protons to different threads for transport.
Although the particles were transported in parallel on the GPUs, the dose deposition process in each thread was not independent.Individual dose deposition on the global dose counter causes a thread block, where the dose information of other threads must wait until the current process of proton information transfer ends, known as atomic addition.This treatment was the primary reason for the time-consuming nature of the GPU-based MC dose engine.In this study, we applied the same strategy as in gPMC V2.0 [14] to mitigate the memory conflict problem.Eight local dose counters were prepared to balance the calculation efficiency and memory size.The deposited energy was randomly assigned to a dose counter after each step, and the total doses were summed for the eight counters after all the particle transports were completed.
The GPU implementation of the aforementioned process is schematically illustrated in Fig.1.Protons were simulated in batches with a batch size ofNbatch=65536 protons in all cases, except for the last batch in our simulation.AfterNpprimary protons were loaded into a batch,the transport simulation was initiated.It is noteworthy thatNpis equal toNbatch.In each batch, the dose information ofNpprimary protons,Nssecondary protons, other secondary fragments, and all tertiary particles was calculated following the physics model and saved in the local dose counterDr(r= 1–8).After all the particle transports were performed in a batch, the dose information inDrwas sent to the CPU memory and added to the total dose,Dtotal=Dtotal+∑r Dr.The simulation process ends when the number of simulated primary protonsNpriis equal to the number of total primary protonsNtotal.
Table 1 Development environment and bases of different platforms
In this study, all TPMC GPU dose calculation processes were performed on an NVIDIA Tesla V100 PCle 32 GB card system with 5120 processor cores.The operating frequency was 1230 MHz with 14 TFLOPS in single-precision mode.All the CPU procedures were performed on an Intel Xeon 8260 CPU with 24 processor cores system,processor base frequency of 2.4 GHz, and 1 TB of maximum memory.
We used a Gaussian pencil beam with a spot size of 5 mm and scattering angle of 0°.The number of primary protons simulated was 107, and the beam energies were 110, 160, and 200 MeV.Four phantoms of different levels of heterogeneity,containing water, lung tissue, and bone tissue were designed for testing, as shown in Fig.2.The size of the phantom was 10 cm × 10 cm × 30 cm, with a 1 mm × 1 mm × 1.5 mm scoring grid.The densities and elemental compositions of the materials are listed in Table 2.
The physics setting of TOPAS included the following Geant4 modules: g4em-standard_opt3, g4hphy_QGSP_BIC_HP,g4decay, g4ion-binarycascade, g4helastic_HP, and g4qstopping, following Testa et al.[27]
Fig.2 Size, shape, and materials of the four testing phantoms
Table 2 Densities and elemental compositions of the materials in the testing phantoms.The densities and composition values are in units of g/cm3 and mass weighted percentage, respectively
3.1.1 Depth–dose curve
Figure 3 shows the depth–dose curves obtained for TPMC and TOPAS in the pure water phantom (phantom in Fig.2).The dose values for beams with different energies are normalized by the maximum dose (Bragg peak) of 110 MeV.The dose differences between TPMC and TOPAS mainly occur around the Bragg peak regions.
3.1.2 Lateral dose verification
Lateral dose verification was performed by comparing the results for the four phantoms calculated by the TPMC and TOPAS at the same incident proton energy,E= 160 MeV.The relative error maps of the dose distributions are shown in Fig.4.In most of the regions with energy deposition,the results of TPMC are consistent with those of TOPAS.These differences mainly appear in the Bragg peak areas for phantoms (a), (b), and (c).Interestingly, for the pure water phantom (a), our predicted dose values are larger than those of TOPAS at energies before the Bragg peak, and TPMC decreases faster than TOPAS after the Bragg peak.These differences increase further when lung tissue is inserted into the proton beam pathway.However, the opposite results are obtained when the inserts were replaced with bone tissue,as shown in panels (b) and (c) of Fig.4.For the heterogeneous phantom, large differences appear along the extension line of the boundary between the two inserted materials,apart from the two Bragg peaks.Despite this difference, the Bragg peak regions of TPMC and TOPAS are consistent in all the test cases.
The influence of the heterogeneous medium was further studied by simulating two additional incident proton energies, 110 MeV and 200 MeV, for phantom (d).The lateral dose contour maps of the TPMC and TOPAS in the three cases, the corresponding lateral profiles at a depth of 6 cm(material boundary), and the two Bragg peaks are shown in Fig.5a–c.The results show that the lateral dose distribution in TPMC is consistent with that in TOPAS, regardless of the proton beam energy, phantom depth, or material difference.
3.1.3 Quantitative analysis
To evaluate the accuracy of the TPMC quantitatively, we calculated the gamma passing rates for the four testingphantoms with three incident proton beam energies.The criteria were set to 1 mm/1% and 2 mm/2%.A mask was added to each phantom whose height and width were set to 5 cm and 30 cm, respectively.The depth of the mask was 1.1 times the depth of the Bragg peak.A deeper Bragg peak was selected for the heterogeneous phantom.The gamma passing rates in various situations are listed in Table 3.We note that all the values are greater than 0.99 and 0.95 under the criteria of 2 mm/2% and 1 mm/1%,respectively.These quantitative results indicate that the TPMC is sufficiently precise for proton treatment planning.In addition, the gamma passing rates of the TPMC gradually decrease with increasing proton beam energy and complexity of the phantom.
Fig.3 (Color online) Depth–dose distribution (top) and relative dose difference (bottom)for the pure water phantom for mono-energetic proton pencil beams of 110 MeV (black peak at 91 mm), 160 MeV (red peak at 175 mm), and 200 MeV (blue peak at 258 mm), calculated by TPMC and TOPAS.The dose value for different energy beams is normalized with respect to the maximum dose (Bragg peak) of 110 MeV
Fig.4 (Color online) Lateral dose difference of TPMC and TOPAS in the four test phantoms for a 160 MeV mono-energetic proton pencil beam:a difference in water phantom, b difference in lung phantom, c difference in bone phantom, and d difference in heterogeneous phantom
3.1.4 CT image validation
The accuracy benchmark of TPMC against TOPAS was then performed on a CT-acquired patient phantom.A square beam with an energy of 110 MeV was incident.The dose distributions are shown in Fig.6.The TPMC dose calculation results for the horizontal and inclined injection beams are shown in Fig.6a, b, and the TOPAS results are plotted in Fig.6c, d.Figure 6e, f depicts the dose differences between the two MC results.For regions with doses above 50% of the maximum dose, the average dose difference is less than 1.5%.We also performed a gamma-index test (2 mm/2%), and the passing rate exceeds 99%.Focusing on the dose points within the 10%isodose lines, the passing rate is greater than 96% for both cases.
The efficiency of the TPMC was tested on an NVIDIA Tesla V100 GPU card system and an Intel Xeon 8260 CPU processor (192 cores) system.The latter was also used in the full-MC simulation using TOPAS.We built the CUDA and OpenCL acceleration programs based on gPMC [10] and gPMC V2.0 [14].For comparison, the same physical model was also encoded on the CUDA and OpenCL platforms.In all the cases processed by GPUs, the local dose counter was optimized to eight to reduce memory conflicts as much as possible.The simulation times for the 107incident protons for various beam energies, platforms, and devices are listed in Table 4.Clearly, the simulation time is primarily determined by the proton beam energies because highenergy protons have longer transport paths and induce more complicated nuclear reactions than low-energy protons.In addition, the simulation efficiency of the three GPU-based platforms was at the same level (within 10 s) and all met the clinical requirements.Owing to the acceleration of the GPUs and simplification of the physics model, the calculation time is generally approximately 7000 times faster than that of the full MC simulations by TOPAS.Taichi did not show explicit improvements in simulation time compared with CUDA and OpenCL because proton transport workflows are essentially the same in GPU-based platforms.
As mentioned in Sect.1, Taichi integrates numerous function code instructions for ease of use.We compared the total number of lines of code written in Taichi, OpenCL, andCUDA, as shown in Table 5.The number of lines of code for programming the physics model was the same for all the three platforms.However, Taichi significantly reduced the total number of lines of code, which was approximately 11 times less than OpenCL and eight times less than CUDA.Such differences mainly come from the GPU implementation because it is quite easy in Taichi where only two lines of code are needed, compared to the thousands of lines needed in other platforms.This improvement makes TPMC an easy-to-use package that allows algorithm developers,students, and medical physicists to focus more on the underlying physics of proton transport rather than on code-writing techniques.
Fig.5 (Color online) Lateral dose difference of TPMC and TOPAS in the heterogeneous phantoms for: a 110, b 160,and c 200 MeV mono-energetic proton pencil beams
Table 3 Masked Gamma passing rates of TPMC for four testing phantoms with 110, 160, and 200 MeV proton beams
To ensure the accuracy of our physics model, it is worth noting that the maximum depth–dose difference in the pure water phantom occurred around the Bragg peak region.This is because nuclear reactions play a dominant role in the production of many short-range secondary fragments.However, the energy of these fragments was assumed to be locally deposited without additional interactions in the TPMC.This difference increases with increasing incident proton energy, which can be understood in the same way that the higher the projectiles energy, the more the fragments produced.In addition, the lateral dose difference mainly originates from the approximate treatment of the delta electrons, whose influence on the proton transport directions is neglected in our physics model.The accuracy of TPMC decreased when other materials, such as lung or bone tissues, were inserted into water.The transport of delta electrons is probably non-negligible in low-density materials for lung tissues [10], while for bone tissues,proton-calcium inelastic nuclear interactions should be included.Moreover, the phantom with the heterogeneous medium exhibited the lowest gamma passing rate.The cross section of protons and materials might dramatically change when protons pass through the heterogeneous interface, but we only consider the material information of the current voxel in that case.These results indicate that the accuracy of TPMC may decrease further when the structure of the human body is more complicated.
The lateral dose differences of TPMC and TOPAS in Fig.4 exhibit opposite results when lung or bone tissues were inserted compared to those of the pure water phantom.This was attributed to the material conversion method used to calculate the water-equivalent steps in Eq.(2).For lung and bone tissues, although energy loss is assumed to be conserved in such conversion, the value offs(ρ,Tp) is not absolutely accurate, since it is obtained by fitting to experimental data [11].Thus, the energy of the protons after passing through the inserts (other than water) changed compared with the real situation, leading to a shift of approximately 1 mm in the Bragg peak position.
Although both CUDA and OpenCL can achieve better computation performance through numerous well-developed functions, the implementation of Taichi in proton dose calculations is still appealing because it supports multiple GPU devices and has a low programming barrier.Taichi also exhibits advantages in other aspects.In proton transport,Taichi continues to process all the particles on the GPUs,whereas OpenCL sends the information of the generated secondary protons from the GPU to the CPU memory after the simulation of each batch finishes.This information is returned to the GPU during secondary proton transport.Such cross-transportation may lead to a relatively low effi-ciency, as shown in Table 4, where the simulation time of OpenCL in each case is slightly longer than that of Taichi.In addition, Taichi supports most Python function libraries such as PyTorch and Numpy, which naturally integrate dose calculations within the framework of deep learning.This feature provides a new aspect for future work, namely, using convolution neural networks to improve the accuracy of the present study.Similar studies were conducted by Ryan Neph et al.[20], who used 3D Unet and high-noise dose maps(105incident particles) to predict low-noise dose distributions (107incident particles) in treatment planning.Another work by Wu et al.[28] improved the accuracy of the pencil beam algorithm using Unet.Both studies yielded promising results, indicating that deep learning has great potential for dose prediction.Finally, the great power of Taichi in gradient descent optimization can be applied to other aspects,such as image processing and dose optimization.Our goal is to develop a Taichi-based treatment planning system for proton therapy in future.
Fig.6 (Color online) Dose calculation result in the case of the patient.The first two rows represent the dose results from TPMC and TOPAS, respectively.The third row shows the relative dose difference between the two methods
Table 4 Comparison of the dose simulation time (in seconds) for 107 incident protons with various beam energies, platforms and devices
Table 5 Comparison of the total number of lines of code and number of lines for the physics model in Taichi, OpenCL and CUDA
We developed a graphics processing unit (GPU)-based Monte Carlo proton dose calculation engine within the Taichi framework.Protons are transported in the class II condensed history simulation scheme with various poststep proton–nucleus interactions.The simulations were parallelized on GPUs for acceleration.The results of the testing models indicated that the accuracy satisfied the clinical requirements adequately.The simulation speed was approximately 7000 times faster than that of the full MC simulation using TOPAS, and the number of lines of code was approximately 10 times less than those of CUDA and OpenCL.Our study provides a fast, accurate,and easy-to-use proton dose calculation engine for algorithm developers, students, and medical physicists.Further studies are needed to implement Taichi in other components of treatment planning, such as image processing and dose optimization.
Author contributionsAll authors contributed to the study conception and design.Material preparation, data collection and analysis were performed by Wei-Guang Li, Cheng Chang, Yao Qin, Zi-Lu Wang,Kai-Wen Li, Li-Sheng Geng and Hao Wu.The first draft of the manuscript was written by Wei-Guang Li, and all authors commented on previous versions of the manuscript.All authors read and approved the final manuscript.
Data availabilityThe data that support the findings of this study are openly available in Science Data Bank at https:// doi.org/ 10.57760/ scien cedb.j00186.00015 and https:// cstr.cn/ 31253.11.scien cedb.j00186.00015.
Conflict of interestThe authors declare that they have no competing interests.
Nuclear Science and Techniques2023年5期