Kai-Wei Sun(孫愷偉) and Fa Wang(王垡),2,?
1International Center for Quantum Materials,School of Physics,Peking University,Beijing 100871,China
2Collaborative Innovation Center of Quantum Matter,Beijing 100871,China
Keywords: neural network,analytic continuation,quantum Monte Carlo
Numerical analytic continuation (AC) solves the following inversion problem:
The goal of AC is to extract the real frequency spectrumA(ω)from the imaginary-time correlation functionG(τ), which is typically obtained by Monte Carlo simulation.The spectrumA(ω) is required to be non-negative at anyω-point and subjected to certain sum ruledωA(ω)=const.K(τ,ω) is the inversion kernel, the form of which varies on specific problems being handled.This study involves two kinds of inversion kernelsK(τ,ω) includingKF(τ,ω)= e?τω/(1+e?βω) andKS(τ,ω)= e?τω.KF(τ,ω) usually appears while calculating single-particle excitation spectra from measured Green’s functions.[1,2]KS(τ,ω) is usually involved while extracting dynamic structure factors from spin–spin correlation functions in some spin models.[3]
To carry out actual calculation,τandωare often discretized asτi=τ1,...,τM,ωi=ω1,...,ωN.Then the target problem can be reformulated asG(τi) =∑K(τi,ωj)A(ωj)?ω.For the purpose of simplicity, ?ωwill be absorbed toA(ωj) byA(ωj)?ω →A(ωj) in further discussions.The sum rule is then discretized to be∑A(ωi)?ω=const.It seems like a simple problem of matrix inversion at first sight but turns out to be a notoriously challenging task due to the ill-conditioned nature of this inversion problem.In almost all cases,corresponding condition numbers go far beyond the tolerance of existing computers’machine precision.Several methods are proposed to solve this problem such as the maximum entropy method (Maxent)[2]and stochastic analytic continuation(SAC).[4–6]Both of them succeed in extracting empirically correct spectra.However,these methods usually demand highly accurate simulated correlation functionsGsim(τ).
As an emerging technique for machine learning, neural networks(NNs)[7]have experienced great success in a variety of physics-related domains.From the perspective of machine learning, analytic continuation can be viewed as a vector-tovector prediction task, whereG(τi) is mapped toA(ωj).To construct a neural network capable of performing analytic continuation, both the network topology and training set should be built appropriately.The common framework for this task usually contains several steps: (1) Build a neural network.(2) Synthesize spectraAtrainfor training purpose.(3) CalculateGtrainby the forward mappingA →G.Noting that the forward mapping is well-conditioned, thusGtraincan be exactly determined.(4) Train the network using the dataset pair(Gtrain,Atrain)so that spectra predicted fromGtrainclosely matchAtrain.(5)When developing and testing NNs,synthesize testing set (Gtest,Atest) and evaluate the performance of the trained NN on it.When using neural networks based analytic continuation (NNAC) in actual tasks, apply the trained network to predict spectraApredfrom simulated correlation functionsGsimgenerated by Monte Carlo simulations.To mimic real-world simulated data, noises are usually added to correlation functions obtained from synthetic spectra such asGtrainandGtest.
In a relatively early study, Hongkee Yoon[8]and coauthors designed a network mainly based on fully-connectedlayers (FCLs).[7]In their research, both training and testing sets are obtained from synthetic Gaussian-type multi-peak spectra.Noises of Gaussian distribution are added toGtrainandGtest.The trained NN performs well in the testing set as the predicted spectra are very close to synthetic testing spectra.Several different network structures[9–12]trained on similar Gaussian-type datasets are also proposed.In addition to synthetic datasets,NNAC is also examined on some exactly solvable models such as one-dimensional transverse-field Ising model[13]and harmonic oscillator linearly coupled to an ideal environment.[14]In these two studies, artificial training sets(Gtrain,Atrain) are generated from exactly solved correlation functions and corresponding spectra.Different spectra in the training set correspond to different parameter values in the Hamiltonian being studied.Target spectraApredare predicted from simulated correlation functionGsimusing Monte Carlo techniques.Reference[13]points out that the neural network’s prediction performance can be improved by adding uniform noises to the exactly solved Green’s functions at each imaginary time in the training set.
Theoretically,we have no knowledge about precise forms of spectra to be predicted before target spectra are actually predicted.That is because the knowledge of Gaussian-type spectra is not expected to be known before prediction.This is actually an intriguing topic dubbed“data leakage”[15]in the field of machine learning.Data leakage occurs when information is used in the training process but not expected to be available at the prediction time.All aforementioned articles about NNAC have the issue of data leakage at some levels.In practice, we usually apply numerical analytical continuation to models that are not exactly solvable, where it is not possible to construct training sets by exactly solved spectra.
To design the training set,hints from experiments or traditional AC approaches such as Maxent should also be explored.It should be mentioned that NNAC is useful even when spectra are already obtained from Maxent: NNAC performs better at least in highly-noisy cases as described in Ref.[14].This topic will also be elaborated upon in this paper.In general,domain knowledge,[16,17]especially possible spectrum peak shapes, should be incorporated when designing the training set as much as feasible but without data leakage.We then expect the trained NN to generalize[18,19]well enough to handle unobserved correlation functions likeGtestandGsim.
Intuitively, people expect better prediction of spectra by incorporating more information.Monte Carlo simulations can provide more information beyond the measured correlation functions, such as the statistical errors ofG(τ).Specifically,they can provide information regarding two aspects of statistical errors: the measured errorsR(τi) ofG(τi) at eachτi, and the covariance of correlation functions at different imaginary times.
This work avoids data leakage while synthesizing the training sets and incorporates information of statistical errors to improve the performance of NNAC.With these means,NNAC has the potential to be a usable algorithm in practical applications and a significant component in the Monte Carloanalytic continuation tool chain.In Section 2,NNAC of kernelKF(τ,ω)is examined on synthetic data,where datasets synthesized from spectra of different types of shapes are addressed.In Section 3, NNAC of kernelKS(τ,ω) is applied to a onedimensional Heisenberg chain as a real-world example of an AC problem.Conclusions are presented in the final section.We demonstrate that NNAC can acquire knowledge embedded within the inversion kernel, as detailed in Appendix A,and discuss the performance of NNAC when the input correlation functions are accurate in Appendix B.In Appendix C,we show how to incorporate information from Maxent into NNAC to enhance its robustness.
In this section, we design and test NNs on synthetic datasets.Principles for generating training sets will be developed.We first discuss three types of datasets, the training framework,as well as the actual training process.Noise level matching between the training and the testing set is then explored.The resulting spectra are then compared with those from Maxent.The impact of measured noise shapes and timedisplaced correlation is then investigated.
Multi-peak spectraA(ω)are produced by summing over single peaksF(ω).
In the formula above,Zis a scaling constant ensuring thatA(ω) obeys the sum rule.In this section, we assumedωA(ω)=1 for convenience.This paper involves three distinct peak types:asymmetric exponential power(ASEP),skew Gaussian (skew), and Lorentz.The ASEP single-peak curve reads
In the above formula,h,m,a1,a2,b1,b2are all control parameters.In this study, we setm ∈[?5,5],a1,a2∈[0.3,3],b1,b2∈[1,3],h ∈[0.2,1].The skew peak takes the form
whereg ∈[1,2],a ∈[2,4] andh ∈[0.2,1].In this study, we investigate spectra containing one to four peaks.At least 105samples are generated for each peak number by randomly selecting control parameters.In other words,one single dataset includes at least 4×105samples.Training and testing sets of the same peak type are independently created.
The ASEP-type dataset has the most control parameters among these three types and thus contains a greater diversity of spectra while not explicitly containing spectra of skewtype or Lorentz-type dataset.We expect the neural network to learn from the ASEP-type dataset and generalize effectively to achieve good performance on the other two datasets.It should be noted that, unlike in some previous studies, we will not examine Gaussian-type spectra here,as they are explicitly included in ASEP-type dataset whenb1=b2=2 anda1=a2.This explicit inclusion case does not frequently occur in realworld AC tasks and the performance of NNAC will be overestimated in the case of Gaussian-type testing sets.
The imaginary timeτ ∈[0,16] is discretized uniformly into 512 pieces and the frequency domainω ∈[?15,15]is discretized into 1024 pieces.βis fixed to be 16 in the Fermion kernel e?τω/(1+e?βω).
Convolution neural networks (CNNs)[20]will be employed in this work.FCL-based neural networks are also evaluated in the early stage of this study, which proves inferior to CNNs.Involvement of residual modules[21]or deep layer aggregation[22]also does not prove to make significant improvements.In the case of deep layer aggregation,both iterative deep aggregation and hierarchical deep aggregation are attempted.Based on the aforementioned factors,we employ the neural network shown in Fig.1.At first the 512-lengthG(τi)is transferred to ap-length vector via an FCL(labeled“Dense”)and then reshaped to be a 1×pmatrix.This matrix can be regarded as a specific image that can be naturally processed by convolution layers.Next, this image is passed to aq-channel one-dimensional convolution layer“Conv1d”,followed by the activation layer“Swish”.Within the“Conv1d”layer,convolution kernels of size 1×3 are used.Within the activation layer,the activation function named “Swish”[23]is used.This activation function is both non-monotonic and smooth and may improve the overall performance of the neural network compared to the commonly used ReLU[24]activation function according to Ref.[23].This“convolution→activation”process will be carried outntimes.Theq-channel image is then compressed by an average-pooling layer[25]and flattened to be apq/2-long vector.The flattened vector will be mapped to a 1024-long vector by another “Dense” layer.Ultimately, the“SoftMax”layer will present predictions of the spectra where the sum rule ∑j A(ωj)=1 is naturally satisfied after this softmax operation.Tricks to reduce overfitting such as dropout[26]are not adopted here.Instead, we recommend enlarging the training set when signs of overfitting emerge since it is rather cheap to acquire data from synthetic spectra.
Fig.1.The convolution-based structure of the neural network used in this work.Hyper-parameters are chosen to be n=8, p=64 and q=64 in actual training process.
Hyper-parameters are chosen to ben=8,p=64, andq=64.To select appropriate hyper-parameters, we build an additional ASEP-type validation set,on which to evaluate NN trained by the ASEP-type training set.When selecting hyperparameters, the trade-off between performance and training time is taken into consideration.
We use Kullback–Leibler divergence (KLD)[27]as the loss function,which takes the form
KLD measures the difference (more precisely, relative entropy)between the true distributionAtrueand the predicted distributionApred, which makes it a natural choice in this task.Other commonly-used loss functions include mean absolute error(MAE)and mean squared error(MSE)as shown below:
KLD also has the property of positivity as MAE and MSE.
Empirically, spectra from NNs with MSE loss are often smoother than those from NNs with MAE loss since MSE punish large spectrum difference more severely.In this study,we didn’t observe discernible difference in the performance between MSE-loss and KLD-loss NNs.
NNs are programmed using Keras toolkits[28]with Tensorflow[29]backends.The Adam[30]optimizer is used for gradient descent.The early-stopping trick is utilized during training.The training process terminates when KLD measured on the validation set does not drop for 20 epochs, where the validation set is generated in the same manner as the training set.Trained weights are then restored to the epoch with the lowest KLD.Each training task will be repeated at least 5 times with different random seeds.KLDs shown in this paper are averaged among NNs trained with different seeds.
Fig.2.Tracking the training process.(a) Relative losses, including KLD, MAE, and RMSE, vs.the number of trained epochs.This socalled relative loss is defined as“l(fā)oss after this epoch”divided by“l(fā)oss after the first epoch”.(b) A typical example of the convergence process of one predicted spectrum to the true spectrum as KLD decreases.Selected checkpoints are labeled by red dots in(a).
The training process is depicted in Fig.2,where both the training set and the testing set are of ASEP-type.Errors at noise level 10?3are introduced toGtrainandGtest(the concept of noise level will be discussed later).Relative values of three statistics measured on the testing set are tracked throughout the training process in Fig.2(a).We trackinstead of MSE itself because RMSE shares the same dimension as MAE and KLD.Relative loss in this figure is defined as“l(fā)oss after this epoch”divided by“l(fā)oss after the first epoch”.In Fig.2(b) we show an example from the testing set of how one predicted spectrum becomes closer to the true spectrum at different KLD levels.Selected checkpoints are indicated by red dots in Fig.2(a).While visualizing the training process,we only use 1000 samples for each epoch because statistics converge too quickly for visualization if the entire training set containing 4×105samples is used.The complete training set will be used in actual AC tasks hereafter.
In this study, model training on an RTX3060 graphics card takes approximately 20 min on average.This is acceptable in the majority of circumstances, especially in contrast to the amount of time saved in the Monte Carlo simulation if highly accurate correlation functions are not incorporated.
Correlation functions measured from Monte Carlo simulation inevitably contain statistical errors.To mimic simulated errors, Gaussian-type noises are added toG(τi) byG(τi)→G(τi)+R(τi), whereR(τi)~N(0,σ2).Four different noise levels are investigated in this work,σ=10?4,10?3,3×10?3,10?2.σin this formula can also be interpreted as the absolute average of noises.At this stage,we assumeG(τi)to be independently measured for eachi.In real-world NNACbased tasks, noises ofGsimare measured from Monte Carlo simulation, and noises of the training set should be carefully arranged accordingly.Besides,the noise level of the testing set should be the same as the simulated data to mimic real-world tasks.
To design the training set, a natural question arises as to how we should set noise levelσtrainof the training set when the noise levelσtestof the testing set is known.We train NNs by training sets of differentσtrainand apply these NNs on testing sets of differentσtest.Corresponding results are shown in Table 1 and Fig.3.Table 1 contains KLDs of spectra predicted from testing sets with different noise levelsσtestby NNs trained by training sets with differentσtrain.The smallest KLD in each line(marked red)is obtained when noise levels of the training set and the testing set match(σtrain=σtest).Performance degrades but remains acceptable whenσtrainincreases andσtrain>σtestwhile the opposite is not true whenσtrain<σtest.For instance, KLD is relatively small when (σtrain,σtest)=(10?2,10?4) but is large and unsatisfactory when(σtrain,σtest)=(10?4,10?2).That’s because information of ASEP(σ=10?4) is somehow “contained” in ASEP(σ=10?2):for each curve in ASEP(σ=10?4)we may find similar samples with similar noises in ASEP(σ=10?2)if datasets are large enough given noises are randomly selected,whereas the converse is not true.We train NNs with different noise levels and use them to predict one sample ofG(τi)from the testing set withσtest=3×10?3andσtest=10?2, which are presented in Figs.3(a)and 3(b),respectively.The resulted spectra become closer to the ground truth whenσtrainis closer toσtest.In Fig.3(b),incorrect and unstable peaks are predicted by the NNs trained withσtrain=10?4or 10?3, whose KLDs are large correspondingly as seen in Table 1.
Table 1.KLDs of spectra predicted from testing sets with different σtest by NNs trained by training sets with different σtrain.In each line,the smallest KLD(marked red)is obtained when σtrain =σtest.To determine the errors of the KLDs in the table, we train NNs with at least 10 distinct random seeds and calculate the statistical uncertainty of KLDs of spectra predicted by these NNs.
Fig.3.Illustration of noise level matching.Ground truths in both subfigures are the same curve.(a) Prediction of spectra from testing set with σ =3×10?3 by NNs trained with different σtrain.The best spectrum is obtained when σtrain=σtest=3×10?3.(b)Prediction of spectra from testing set with σ =10?2 by NNs trained with different σtrain.The best spectrum is obtained when σtrain =σtest =10?2.The predicted spectrum contains unstable peaks at wrong locations when σtrain=10?4 or 3×10?3.
Note that in this part, data leakage is not intentionally avoided: the training set and the testing set are both of ASEP type.With the sameσtest,KLD differences caused by differentσtrainmay be relatively small and taking datasets with different line shapes may introduce unnecessary complexity,resulting in unsolid or even incorrect conclusions.From another perspective, we expect NNs to use the knowledge learned from the training set to predict correct spectra in actual tasks.The performance will be usually slightly weakened if the line shapes of the testing set and training set are different.Therefore, we expect the NNs of properσtrainto at least achieve good results on the testing set with the same line shape.The KLD results here do not represent the actual performances of the NNs in practical tasks.
With the knowledge of noise level matching,Gtrainwill be designed to have the same noise level asGtestin this work hereafter and we are now ready to compare NNAC with traditional AC methods like Maxent.We train NNs with ASEP training sets and use them to predict ASEP-type, skew-type and Lorentz-type spectra, respectively.Corresponding outcomes are depicted in Fig.4.Figures 4(a),4(b)and 4(c)show KLDs of spectra predicted by these two methods on ASEP,skew,and Lorentz dataset respectively.Error bars of KLDs are omitted in this and subsequent figures to make graphs more comprehensible as they are relatively small.Typical predicted results at noise level 3×10?3are shown in Figs.4(d), 4(e) and 4(f)of three peak types.The performance of NNAC is comparable with Maxent at the lowest noise level 10?4but outperforms Maxent significantly at relatively high noise levels.The improvement of the prediction effect is also obvious when the training set and testing set are not of the same spectrum type.In spectrum examples depicted in Figs.4(d), 4(e) and 4(f),peak locations are precisely predicted by NNAC but Maxent didn’t provide accurate peak locations at this noise level.In some frequencies,Maxent may even give incorrect signals of peaks.Peak heights predicted by NNAC are also more accurate and closer to ground truths than Maxent’s.
Spectra from Maxent in this section about kernelKF(τ,ω) are calculated mainly based on the software“TRIQS/maxent”[31]so that results can be easily checked.Variousα-choosing algorithms are evaluated, whereαis the penalty coefficient of the entropy term in the Maxent objective function.[2]Among these algorithms discussed in Ref.[31],“χ2-curvature”, which is analogous to?-Maxent,[32]and“Bryan”algorithms greatly outperform others in terms of KLD in tasks of interest.Between these two, “χ2-curvature” is marginally superior to the Bryan algorithm.In this way, we use“χ2-curvature”in this work to ensure a level playing field for Maxent.
Fig.4.Comparison with Maxnet.NNs are trained by the ASEP dataset and applied on three different testing sets: ASEP,skew,and Lorentz.(a)–(c) KLD predicted results of ASEP, skew, Lorentz datasets respectively at different noise levels.(d)–(f) Typical predicted spectra at the noise level 3×10?3 by Maxent and NNAC.The ground truth is also shown for comparison.The performance of NNAC is comparable with Maxent when the dataset contains low-level noise but outperforms Maxent at high-level noise even if NNs are not trained by the dataset of the same type as the testing set.
In the preceding discussion, noiseR(τi) at eachτiis assumed to be sampled from the same Gaussian distribution and has the same variance,which is rarely the case in Monte Carlo simulation.We introduce the noise-shape-multiplierλ(τ) to investigate the influence of noise dependency on imaginary time and assumeR(τi)~N(0,σ(τi)2),σ(τi)=λ(τi)σ.We refer to this dependency as“noise shape”hereafter.These multipliers satisfyλ(τ)dτ=1 to ensure that datasets with the sameσbut different noise shapes are at approximately the same noise level.λ(τ) of four distinct linear shapes labeled A,B,C,and D are displayed in Fig.5(a).
To demonstrate the impact of noise shape and how to appropriately arrange noises in the training set,we train NNs by ASEP-type training sets with equal noise(λ(τ)=1)and noise shape A, respectively.These trained NNs are implemented on skew-type testing sets with noise shape A.Corresponding measured KLDs are presented in Fig.5(b).Spectra examples at noise level 3×10?3are shown in in Fig.5(c).
Origins of different performances by different noise shapes can be, to some extent, explained by permutation feature importance(PFI),[33]despite the fact that neural networks are typically seen as black boxes.To calculate PFI, we rearrangeG(τi)randomly over samples on one certain time pieceτiin the testing set and PFI at this time piece is defined by how much the resulted KLD increases.PFI difference between NNs trained by datasets of linear noise shapes and equalnoise dataset is defined as [PFIT(τi)?PFIE(τi)]/[PFIT(τi)+PFIE(τi)].PFIE(τi) denotes PFI from NNs trained by equalnoise dataset and PFIT(τi) denotes PFI from NNs trained by dataset of some other noise shape,whereT ∈[A,B,C,D].Resulted relative PFI differences are shown in Fig.5(d).Moving average of adjacent five points is carried out to make curves smoother and clearer.Relative PFI curves andλ(τ) curves increase or decrease in the opposite direction, which means NNs assign large feature importance on imaginary time pieces whereG(τi)are less noisy.
It should be emphasized that measured correlation functions do not often have linear-type noise shapes.Instead,they are frequently of exponential-like shapes.However,things can become more subtle in the case of exponential noise shape,when it becomes more difficult to disentangle the effects of different noise levels and noise shapes.In light of these concerns,we only examine linear-type noise shapes here,and it is believed that physical images are similar in other scenarios.
Fig.5.Influence of linear noise shapes.(a) Four types of shape multiplier λ(τ).(b) Noises of the testing set are of shape A.Two neural networks are trained by the training set of equal noise (λ(τ)=1) and noise shape A, respectively.KLDs of trained neural networks are compared on the shape-A testing set at different noise levels.(c) Typical spectra predicted from G(τ) of shape-A noises (σ =3×10?3) by neural networks trained by equal-noise and shape-A training sets,respectively.(d)The relative difference in PFI between the neural network trained on a training set with linearly-shaped noise and the neural network trained on a training set with uniformly-shaped(λ(τ)=1)noise at various imaginary times.
In the previous discussion, we assumed that the correlation functionsG(τi) on different imaginary timesτiare independently measured, which means there is no mutual correlation between correlation functions at different imaginary times.In actual Monte Carlo simulations, independent measurements can indeed be achieved,but they come with significant computational overhead.Generally speaking,the correlation functions at different imaginary times are usually measured simultaneously on a single Markov chain,and thus there may be strong mutual correlations between them.The impact of mutual correlations can be measured by the covariance matrixΣ,defined as
Here〈···〉denotes the average over Monte Carlo samples.In practical analytic continuation tasks, the covariance matrixΣcan be obtained before synthesizing the training set.If it is required that spectra in the training set or the testing set also satisfy a certain covariance matrixΣ,errors should be obtained from the joint normal distribution,i.e.,R ~N(0,Σ):
1) Perform singular value decomposition (SVD) on the covariance matrixΣ=UDVT.Since the covariance matrix is real-symmetric,we haveU=V.
2)Defnie the diagonal matrixdbydii=.
3)Define another matrixL=Ud.
4)Generate the vectorr,where each element is a sample from the standard normal distribution,ri ~N(0,1).
5)CalculateR=Lr.
Of course,one can also use the Cholesky decompositionΣ=LLTto directly generate the matrixL,but it is widely accepted that the numerical stability of SVD is better than that of Cholesky decomposition.Maxent requires to perform SVD on the covariance matrix to obtain the inverse of the singular values.Given that some singular values may be close to zero,this operation can exhibit numerical instability.Conversely,NNAC circumvents this possible numerical instability arising from the inversion of singular values.
To illustrate the influences of time-displaced correlation,we create the toy covariance matrix for the testing set by
In this work, we will investigate correlation matrices with condition numbers being 103, 106, 109, and 1012respectively by adjustingγ.U(τ)are generated at four noise levelsσ ∈[10?4,10?3,3×10?3,10?2].
NNs are trained by ASEP-type datasets and are to be applied to skew-type testing sets with various noise levels and condition numbers.Training sets are designed in two manners: they may have zero correlation (R ~N(0,Σ0)) or the same correlation as the testing set(R ~N(0,Σ)).Σ0have the same diagonal elements asΣbut all off-diagonal elements ofΣ0are 0.In Fig.6(a), condition number of the testing set is fixed to be 1012.NNs are trained by the dataset with or without time-displaced correlation on each noise level.As illustrated, the influence ofτ-correlation is not significant at low noise levels but correlation mismatching may lead to incorrect prediction at high noise levels.In Fig.6(b), the noise level of the testing set (and the training set, as well) is fixed to be 10?2, where KLDs are lower when the condition number is larger.The reason may be thatR(τi)are dominated by only a few singular values ofΣ,whose pattern of noises is relatively easy to be learned by NNs.Spectrum examples are shown in Fig.6(c) with noise level 10?2and condition number 1012,which contain predicted spectra by NNs trained with zero or the same correlation as the testing set, as well as the ground truth.Clearly, the predicted spectra contain wrong peaks at the wrong locations when the time-displaced correlation is not matched.
Fig.6.Illustration of influences of time-displaced correlation of G(τi).Noise levels of the training set and the testing set are matched.(a)The condition number of the correlation matrix is fixed to be 1012.NNs trained by datasets without correlation may give wrong predictions if G(τi)in the testing set is correlated,especially when the noise level is high.(b)Noise levels are fixed to be 10?2.KLDs are shown with respect to different condition numbers.(c)Spectra examples with condition number 1012 and noise level 10?2.
In this section, NNAC is carried out to extract dynamic structure factors of the spin-1/2 anti-ferromagnetic Heisenberg chain of lengthL,which reads
whereSirepresents a spin located on sitei.Periodic boundary condition is assumed, i.e.,SL+1=S1.Imaginary-timedisplaced spin–spin correlation ofz-component is measured by stochastic series expansion[34]
Gi,j(τ)is time-displaced spin–spin correlation ofz-component between spiniand spinj.Target correlation functionGk(τ)in the wave-vector domain is then calculated via Fourier transformation.ridenotes the location of spini, where the lattice constant is set to be 1.Jis used as the energy unit.We set the inverse temperatureβ=1 in the Monte Carlo simulation.In this work we will focus onk=πandGk(τ)will be represented byG(τ) for the sake of simplicity.Then the AC task readsG(τ)=dωe?τωA(ω), whereA(ω) is the target dynamic structure factor.The corresponding sum rule is obtained by settingτ=0,i.e.,dωA(ω)=G(0).
The same NN structure and hyper-parameters are used as in the previous section.Frequencyωtakes the rangeω ∈[?10,10].The time domain and the frequency domain are discretized into 512 and 1024 pieces respectively as before.The spectrum of the Heisenberg chain can be regarded as a sum ofδfunctions at zero temperature.Theseδfunctions broaden as temperature increases.We perform quantum Monte Carlo simulation on a 32-site Heisenberg chain,whereδfunctions are dense enough on the required energy scale ?ω ~0.02 so that a smooth spectrum can be obtained.The stochastic series expansion approach with loop-update[34]algorithm is used in the simulation.Spin–spin correlation is measured every 100 update steps so that auto-correlation can be ignored.The covariance matrixΣis measured byΣij=[〈G(τi)G(τj)〉?〈G(τi)〉〈G(τj)〉]/(Ns?1), whereNsis the number of independent samples.
Spin–spin correlation functions are measured using different numbers of Monte Carlo samples to create datasets of different noise levels.In this section, noise levels are represented by relative statistical errors ofG(0),which takes range from 3.8×10?3to 3.6×10?2.SimulatedG(τ) are divided by correspondingG(0)before being fed into neural networks so that the sum rule is restored todωA(ω)=1.Then the“SoftMax” layer results in the correct sum rule and the scale of extracted spectra will be recovered accordingly by multiplying withG(0).
First,we will consider the case where there exist mutual correlations between simulated correlation functions at different imaginary times.In each Monte Carlo measurement step,we simultaneously measure spin–spin correlation functions at each imaginary time.Condition numbers of measured covariance matrices range from about 1017to 1018.As described in Subsection 2.6,we incorporate the joint normal distributionR ~N(0,Σ0)orR ~N(0,Σ)to generate errorsRof the training set.The diagonal elements ofΣ0are the same as those ofΣ,while all off-diagonal elements are 0.
In Fig.7, we show predicted spectra at different noise levels.The predicted spectra are labeled “Zero Corr.” when mutual correlations between imaginary times are not incorporated in the training set (R ~N(0,Σ0)).When mutual correlations are considered while generating the training set, we label predicted spectra as“Same Corr.”.The black dashed line in the figure corresponds to the spectrum given by Okamotoet al.[35]They used the finite-temperature Lanczos method(FTLM) to calculate dynamic structure factors of an antiferromagnetic Heisenberg chain of length 24,where some artificial peak broadening was added to results.Correlation functions obtained from independent Monte Carlo simulations may be slightly different,so we repeated Monte Carlo simulations with the same update and measurement steps at least 5 times and performed analytic continuation accordingly to estimate the variance of the predicted spectra.The blue solid line in each subfigure of Fig.7 represents the averaged spectrum,while the light blue area corresponds to the range within one standard deviation for each imaginary time.
As shown in Figs.7(a) and 7(d), the difference between cases“Zero Corr.”and“Same Corr.”is more significant when the noise level is lower.The predicted spectrum is inaccurate and has a larger variance when not considering imaginary-time mutual correlations.This difference decreases as the noise decreases, as shown in Figs.7(c) and 7(f).When the noises of the input correlation function are relatively small, there is no significant difference between the predicted spectra in these two cases.
It is worth noting that although the curve calculated by FTLM shown in Fig.7 is obtained from a Heisenberg chain withL=24 rather thanL=32 considered in this work,at relatively high accuracies,results of NNAC are in good agreement with the FTLM spectrum,at least in terms of peak position and peak width within one standard deviation.
Fig.7.Dynamic structure factors of the one-dimensional antiferromagnetic Heisenberg chain predicted by NNAC.Monte Carlo simulations with the same update steps and measurement steps are performed independently at least 5 times to obtain different spin–spin correlation functions,then NNAC is performed accordingly.The blue solid line in each subfigure represents the averaged spectrum predicted by NNAC from several independently measured correlation functions.The light blue area indicates the range of one standard deviation of the spectrum.Subfigures (a), (b), and (c) correspond to the training process without considering imaginary time mutual correlations, while subfigures (d),(e), and (f) correspond to the training process that considers imaginary time mutual correlations.Data of the black dashed line comes from Okamoto et al.[35] They use the FTLM method to solve the dynamic structure factor of a one-dimensional Heisenberg chain of length 24.
Fig.8.Spectra predicted by Maxent and NNAC in the absence of imaginary time mutual correlation.Monte Carlo simulations with the same update steps and measurement steps are performed independently at least 5 times to obtain different spin–spin correlation functions, then NNAC is performed accordingly.The blue solid line in each subfigure represents the averaged spectrum predicted by NNAC from several independently measured correlation functions.The light blue area indicates the range of one standard deviation of the spectrum.Subfigures(a), (b), and (c) correspond to the training process without considering imaginary time mutual correlations, while subfigures (d), (e), and (f)correspond to the training process that considers imaginary time mutual correlations.At the noise levels involved in this figure, the NNAC results are better than those of Maxent.Data of the black dashed line comes from Okamoto et al.[35] They use the FTLM method to solve the dynamic structure factor of a one-dimensional Heisenberg chain of length 24.
Due to the high condition number of the covariance matrixΣ, Maxent cannot generate reasonable spectra in this situation.To compare the performance of Maxent and NNAC,we independently measure correlations at each imaginary time,so the off-diagonal elements of the corresponding covariance matrix are 0.The analytic continuation task we need to deal with returns to the situation described in Subsection 2.5:the correlation function has no imaginary time mutual correlation, but noises have imaginary time dependence.As before,we perform Maxent and NNAC at different noise levels and show relevant results in Fig.8.At the different noise levels involved in the figure, spectra predicted by NNAC are more accurate than those given by Maxent,which agrees with Subsection 2.4.
Applications of neural network-based analytic continuation were discussed in this paper.Numerical experiments are carried out on both synthetic datasets and Monte Carlo data.The main conclusion is that a NN can learn from a carefully designed training set and make good predictions on spectra without data leakage,which surpasses Maxent in highly noisy cases.To ensure that the neural network acquires adequate knowledge to predict the target spectral functions,the training dataset should comprise a sufficient number of diverse spectral functions.Incorporating information of measured statistical errors leads to better prediction of spectra.G(τi)of the training set should match those of simulated correlation functions in terms of noises at eachτiand time-displaced correlation.
While acceptable, the time required for NNAC is relatively long compared to Maxent.Improving the efficiency of model training may be a fruitful area for future investigation.It may be possible to apply the idea of transfer-learning[36]here so that we do not need to train a model from scratch for each target spectrum but rather begin with a pre-trained model.A more valuable and ambitious goal is to train a model that is general to any spectrum.The input to this model should probably be the simulated correlation functions and the accompanying covariance matrices, which contain most (if not all)information needed to perform analytic continuation.
In Subsection 2.5,we studied how the noise shape affects predictions of NNAC by analyzing PFI.Here we also use PFI to study feature importance for the spectrum at a specific frequency.The feature importance of the correlation function at imaginary timeτifor the spectrum frequencyωjis denoted asI(τi,ωj)and can be calculated in following steps:
1)Train the NN by the training set(Gtrain,Atrain).
2)Predict spectraApredfrom testing correlation functionsGtest.
3)Traverse each imaginary timeτi:
3-a)Permute features atτiof the testing correlation functionsGtest.Use trained NN to predict spectraApermfrom permutedGtest.
3-b)For each sample in the testing set,calculate
whereAsis thesthsample in the testing spectra.
3-c)Take average over samples
whereNsis the number of samples in the testing set.
3-d)RescaleI(τi,ωj)
This step ensures that sums of feature importance at each frequency are the same.
4)Rescale feature importances into the range[0,1],
Fig.A1.(a)Feature importance for each specific frequency.The training set is of ASEP-type,and the testing set is of skew-type.The noise level of correlation functions is 10?4.Here we don’t consider the mutual correlation between imaginary times.The inversion kernel is KF(τ,ω).(b)The inversion kernel rescaled by Eqs.(A3)and(A4).To denoise the data, we have performed moving average with a window length of 20 points in the imaginary time direction in both subfigures.
Here we use ASEP as the training set,skew as the testing set,andKF(τ,ω)as the inversion kernel.Noises of input correlation functions are set to be 10?4and we don’t incorporate mutual correlation between imaginary times.Resulted feature importanceI(τi,ωj)is shown in the left part of Fig.A1.The inversion kernelKF(τ,ω) rescaled by Eqs.(A3) and (A4) is shown in the right part of Fig.A1.Indeed,the trained NN has successfully learned some knowledge contained in the inversion kernel: whenωis small, the trained NN relies onG(τ)with largeτ; whenωis large, the trained NN relies onG(τ)with smallτ.
Some algorithms like DMRG or tensor network can provide accurate correlation functions at the precision of a double-precision floating-point number, which is approximately 10?16.To handle the analytic continuation from accurate correlation functions,Maxent usually requires the manual addition of noises to correlation functions.As an extension of this work, we consider the performance of NNAC when the error level is extremely low.
Fig.B1.Performance of NNAC when noise levels are very low.The training set is of ASEP-type.The testing set is of skew-type.(a)KLDs of spectra predicted by NNAC at different noise levels.(b)One typical example of predicted spectra.
As in Subsection 2.3,we train the NN by the ASEP-type training set and evaluate the trained NN on the skew-type testing set at different noise levels.The mutual correlations between imaginary times are not incorporated.Corresponding results are shown in Fig.B1.We show KLDs of spectra predicted by NNAC at different noise levels in Fig.B1(a).One typical example is presented in Fig.B1(b).On one hand,when the noise level is very low,NNAC can still be effective(extra noises are not needed);on the other hand,when the noise level is less than 10?5, reducing the noise does not bring a significant improvement to NNAC’s performance.When dealing with high input accuracy, the NN’s performance may be limited by the size of the network architecture and the training set.
As mentioned earlier, NNAC requires the training set to be sufficiently abundant and diverse.A potential risk of NNAC is that there is not enough overlap between spectra in the training set and the target spectrum.In this section,we attempt to incorporate knowledge from Maxent into the training set to obtain more robust predictions,ensuring that the performance of NNAC is at least not worse than Maxent.The basic idea is to generate an extra set of spectra from the spectrum generated by Maxent and incorporate this extra dataset into the training set.
Suppose the correlation function obtained from simulation isGsim(τ), and the spectrum produced by Maxent isAmaxent(ω).As the first step, we need to generate a set of spectraAdeformbased on the Maxent spectrumAmaxent(ω):
1) Decompose the spectrumAmaxent(ω) into individual peaks to obtain a series of single-peak curvesFi(ω), wherei=1,2,...,M.
2) RepeatNdeformtimes to obtain a sufficient number of samples:
2-a) Deform each single-peak curveFi(ω) to obtainJi(ω).
2-b) Sum up deformed curvesJi(ω) to getQ(ω) =∑i Ji(ω),whereZis a constant to ensure thatQ(ω)satisfeis the corresponding sum rule.
2-c)AddQ(ω)to the set of spectraAdeform.
The above process involves two important operations:one is peak separation,and the other is deformation of singlepeak curves.To carry out peak separation, we use the differential evolution algorithm to fit the Maxent-generated spectrumAmaxent(ω) with a multi-peak ASEP-type curve.Fitting residuals are assigned to each single-peak curve to ensure the sum of single-peak functions recovers the original spectrumAmaxent(ω).The steps of the peak separation are as follows:1) Denote the trial spectrum asAtrial(ω) =∑SEP(ω),whereFASEP(ω)is defnied in Eq.(3).The single-peak curveFASEP(ω) has 6 control parameters, so the trial spectrum has 6Mcontrol parameters.
2) Minimize MSE(Amaxent(ω),Atrial(ω)) to obtain optimized spectrumAoptimal(ω).
3)Denote fitting residuals asAresidual(ω)=Amaxent(ω)?Aoptimal(ω).Assign residuals to each single-peak curve by
Fig.C1.An example of peak separation.The spectrum to be separated is the blue solid line labeled “Original”, which is of skew-type.The yellow and green dashed lines represent the two single-peak curves after separation.The sum of these two single-peak curves reproduces the original double-peak spectrum.
In Fig.C1,we show an example of peak separation withM=2.The blue solid line represents a skew-type spectrum to be separated,and the yellow and green dashed lines represent the two single-peak curves after separation,labeled as peak 1 and peak 2,respectively.
To deform the single-peak curve, we define four operations,namely“raise”,“move”,“shrink”and“distort”as shown in Fig.C2.Blue solid lines represent single-peak curves before deformation and yellow dashed lines represent singlepeak curves after deformation.In fact,the peak to be deformed in Fig.C2 is“peak 2”after peak separation in Fig.C1.
The operation “raise” is defined asFafter(ω) =rraiseFbefore(ω), whererraiseis the cofficient.This operation aims to adjust the peak height.In Fig.C2(a),we setrraise= 1.2.The operation “move” is defined asFafter(ω) =Fbefore(ω ?ωmove).This operation adjusts the peak location without changing the shape.In Fig.C2(b),we setωmove= 1.The operation “shrink” is defined asFafter(ω)=Fbefore(ωc+rshrink(ω ?ωc)),whereωcis the peak location of the curveFbefore(ω)andrshrinkis the shrinking coefficient.This operation can be performed separately on the left half(ω<ωc)and the right half(ω>ωc).In Fig.C2(c),we set the shrinking coefficient of the left to ber=0.5 and don’t deform the right half.The shrinking operation aims to change the peak width, and if the left and right shrinking coefficients are different, it also changes the skewness of the single-peak curve.Given the distortion coefficientrdeform,the distortion operation shown in Fig.C2(d) is performed as follows:
2)Defineα=ln(rdeform)/ln(0.5)and deform ?Fbefore(ω)by
As in the “shrink” case, different distortion coefficients can also be assigned to the left half and the right half of the singlepeak curve.In Fig.C2(d), the distortion coefficient for the left half isand the right half is not distortedIf distortion coefficients for the left half and the right half are the same,this operation changes the kurtosis of the targeted single-peak curve.
We will show how to obtain a more robust training set by incorporating the Maxent-generated spectrum for the AC task of a skew-type double-peak spectrum.First, we define three datasets:LASEP(left ASEP),RASEP(right ASEP),and DMS(deformed maxent spectra).Similar to Eqs.(2)and(3),spectra in LASEP and RASEP are still based on single-peak ASEP-type curves generated from random control parameters.The parameter selecting ranges for LASEP arem ∈[?3,0],a1,a2∈[0.3,5],b1,b2∈[1,3], andh ∈[0.2,1].For RASEP the ranges arem ∈[0,3],a1,a2∈[0.3,5],b1,b2∈[1,3], andh ∈[0.2,1].Note that peak locationsωcof spectra in LASEP are less than 0, and peak locations of spectra in RASEP are greater than 0.Spectra in DMS is the set of spectraAdeformdeformed from the Maxent-generated spectrum,where deformation coefficients are randomly selected in rangesrraise∈[0.3,1],ωmove= [?0.5,0.5],∈[0.25,1.5], and∈[0.2,0.8].Predictions of NNs trained by various training sets are shown in Fig.C3.
As before,we add noises to the true correlation functionGtrue(τ) to mimic simulated correlation functionsGsim(τ)=Gtrue(τ)+R(τ),R(τ)~N(0,σ2).We generate at least five different“simulated”correlation functions for each noise level with different random seeds to evaluate the standard error of resulted KLDs of predicted spectra.The structure and training process of NNs are the same as in Section 2 but the inverse temperature is adjusted to beβ=4 to avoid unnecessary complexities,where Maxent can generate reasonable spectra.
Figures C3(a), C3(d), and C3(g) compare the performance of NN trained by DMS with that of Maxent.Figure C3(a)shows KLDs of spectra predicted by these two methods at different noise levels.Even though spectra in DMS are generated based on Maxent-generated spectra,the NN trained with DMS can still provide more accurate predictions than Maxent itself.We present examples of the predicted spectra at two noise levels in Figs.C3(d)and C3(g).
Figures C3(b),C(e),and C3(h)compare the performance of NNs trained by LASEP and LASEP+DMS.LASEP doesn’t contain peaks with peak locations greater than 0(ωc>0),and the ground truth contains one peak thatωc>0.Apparently,the NN can not learn enough knowledge to provide good predictions.However,if DMS is added to the training set,which means incorporating the information from Maxent,the NN can make accurate predictions, at least not worse than Maxent itself.Similarly, Figures C3(c), C3(f), and C3(i) compare the results of NNs trained by DMS+RASEP and RASEP itself.When the training set only contains RASEP,spectra predicted by the neural network are inaccurate with large KLDs.When DMS is added to the training set, the performance is at least not worse than Maxent.
If a training set does not contain enough information for NNs to make good predictions, DMS can be added to ensure that NNAC is at least not worse than Maxent.In other words,DMS makes NNAC more safe and more robust.
Fig.C2.Four types of deformation operations: raise,move,shrink,and distort.(a)Raise: this operation change peak height.(b)Move: this operation changes peak location.(c)Shrink: this operation changes the peak width.Different shrinking coefficients can be defined for the left half and the right half,respectively.(d)Distort: this operation changes the kurtosis of the single-peak curve.Different distortion coefficients can be defined for the left half and the right half,in which case the skewness also changes.
Fig.C3.This figure compares the predictions from neural networks trained on 5 different training sets and Maxent.These five training sets are DMS(distorted maxent spectra),LASEP(left ASEP),RASEP(right ASEP),DMS+LASEP,and DMS+RASEP.The spectra in DMS are generated by deforming the Maxent-generated spectrum.The spectra in LASEP are of ASEP-type,but peak locations of all single peaks are less than 0.The spectra in the RASEP dataset are also of ASEP-type,but the peak locations of all single peaks are greater than 0.(a)KLDs of spectra predicted by NN trained by DMS and Maxent.The DMS-trained NN outperforms Maxent at all four noise levels involved.(d)and(g)Examples of the predicted spectra by these two methods at two noise levels.(b)KLDs of spectra predicted by NNs trained by DMS+LASEP and LASEP.(e)and(h)Examples of the predicted spectra from these two methods at two noise levels.(c)KLDs of spectra predicted by NNs trained by DMS+RASEP and RASEP.(f)and(i)Examples of the predicted spectra from these two methods at two noise levels.
Acknowledgements
FW acknowledges support from the National Natural Science Foundation of China (Grant Nos.12274004 and 11888101).Quantum Monte Carlo simulations are performed on TianHe-1A of National Supercomputer Center in Tianjin.