Jie SUN
School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo 315211, China
Abstract: Deep learning provides an effective way for automatic classification of cardiac arrhythmias, but in clinical decisionmaking, pure data-driven methods working as black-boxes may lead to unsatisfactory results. A promising solution is combining domain knowledge with deep learning. This paper develops a flexible and extensible framework for integrating domain knowledge with a deep neural network. The model consists of a deep neural network to capture the statistical pattern between input data and the ground-truth label, and a knowledge module to guarantee consistency with the domain knowledge. These two components are trained interactively to bring the best of both worlds. The experiments show that the domain knowledge is valuable in refining the neural network prediction and thus improves accuracy.
Key words: Domain knowledge; Cardiac arrhythmia; Electrocardiogram (ECG); Clinical decision-making
In recent years, deep learning technology has provided a new and effective example for making clinical decisions from pathophysiologic data (National Center for Cardiovascular Diseases, 2019). Some works have achieved better performance than a human specialist (Hannun et al., 2019). These successful models are all data-based learning methods; that is, the models take raw electrocardiogram (ECG) data as input, extract features, and output a prediction based on the input data. However, pure data-driven methods may lead to unsatisfactory results due to an unbalanced, incomplete, or biased dataset, and may not meet the constraints prescribed by natural law. A promising solution is integrating domain knowledge in the neural network pipeline to correct the deviation.
In this paper, we propose a general framework to address the questions in ECG arrhythmia classification, including: (1) How to represent the clinical knowledge so that it can be injected into the deep learning architecture? (2) How can domain knowledge affect the deep neural network (DNN) learning process, when the learning is based on gradient descent and back propagation? (3) Does the integration really improve or reduce the performance of the DNN? And how?
23.To take the air along the river-side: To air in this instance means to expose to cool or cold air so as to cool or freshen (WordNet). In other words, the king and his daughter are going for a ride to enjoy the fresh air, a soothing activity especially before the modern era of good ventilation and frequent bathing.Return to place in story.
In recent years, the DNN model has been applied in the diagnosis of different cardiac diseases, such as heart arrhythmias (Acharya et al., 2019; Baloglu et al., 2019). Although DNNs have had significant success, they still have limitations in specific tasks because they are purely data-driven and are highly dependent on the training data. A solution is to integrate prior knowledge in the training process, and a variety of approaches have been proposed.
Domain knowledge can be applied to select appropriate data before they are fed into the DNN model. There are 12 leads in a conventional ECG. The six leads I, II, III, aVR, aVL, and aVF are limb leads, and the six other leads V1, V2, V3, V4, V5, and V6 are precordial leads (Surawicz and Knilans, 2008). Some leads have more pathological value for detection of a particular disease; for example, leads V2, V3, V5, and aVL are more sensitive and valuable in detecting myocardial infarction, and thus the related leads are selected as input instead of all 12 leads (Liu WH et al., 2018).
Domain knowledge can also be applied to analyze the inherent correlation of the input data. A classification model called MBCRNet designs three branches and considers synchronization and orthogonality of multiple leads (Chen B et al., 2018) to explore the different features. The average accuracy is 87.04% and the sensitivity is 89.93%.
Domain knowledge can be integrated with the DNN by decision fusion methods. The DNN makes a prediction and the clinical knowledge model (represented as diagnosis rules) performs inference separately, and the two results are fused to obtain the final decision (Jin and Dong, 2017).
Many works leveraged domain knowledge to refine the prediction result of a DNN model, which is called post-processing in some literature. Zhou et al. (2017) used ensemble classifiers to divide the ECG records into two categories, premature ventricular contraction (PVC) and non-PVC, and then rule-based inference was performed for each category to further refine the prediction result. Singstad and Tronstad (2020) individually classified 27 cardiac abnormalities with the deep learning model and rule-based algorithm. If there was inconsistency between the two results, the DNN classification result was rewritten by the rule-based algorithm. Parvaneh et al. (2018) applied DenseNet to classify the ECG record into four categories. In view of the high misclassification between the categories “normal sinus rhythm (NSR)” and “other rhythm (O),” once the absolute difference between the predicted probabilities of the two categories was less than a heuristic threshold (0.4 in the paper), a binary classification will start working to make the final decision.
A variety of methods have been proposed to integrate knowledge with the DNN model and simultaneously perform training. This paper focuses on the use of logic, more specifically, first-order logic (FOL), to represent domain knowledge.
Rule distillation has been proposed to refine the knowledge represented by FOL rules for the DNN model, where the rules will force the DNN model to simulate the prediction of the rules during training through posterior regularization (Hu et al., 2016).
In this paper, we propose a generalized framework that enables integrated learning of the DNN and domain knowledge. The architecture is composed of three modules (Fig. 1): a baseline DNN classifier, a knowledge inference module, and a joint learning module. The DNN is an arbitrary neural network that takes a preprocessed signal as input and produces the probability of the category to which the input belongs. The knowledge inference module comprises a knowledge base and a rule-grounding, matching, and scoring (GMS) module. The outputs of the DNN model and the knowledge inference module arendimensional vectors, wherenis the number of categories. The joint learning module will train the DNN model and knowledge inference module with backward propagation.
Allison struggled away from her white Renault, limping with the weight of the last of the pumpkins. She found Clark in the twilight4 on the twig-and-leaf-littered porch behind the house.
We prefer this method because the classification model can learn from the data and the domain knowledge jointly. The structural knowledge represented with FOL rules can be integrated into the neural network without changing the DNN model’s training process. Our method applies logic rules to represent domain knowledge, but the weight of each rule is not manually specified and will be regulated and optimized jointly with DNN weights during the learning process. Thus, the knowledge specification will also adapt to the meaningful data.
Logic is not differentiable, so many methods integrate logic rules as constraints or regularization terms of the DNN model, and perform relaxation to make them amenable to gradient-based learning. Semantic based regularization (SBR) represents the logic as a regularization term in the loss function to provide a penalty when the DNN model prediction violates the knowledge (Diligenti et al., 2017). Probabilistic soft logic (PSL) consists of a set of FOL rules and the satisfaction distance of the grounded rules is added to the loss function as a regularization term (Kimmig et al., 2012). Abductive learning is a framework that unifies machine learning and logical reasoning (Dai et al., 2019). In each training epoch, the conventional neural network is used to produce primitive logic facts, called pseudo-labels, and logical reasoning is used to revise incorrect pseudo-labels based on the domain knowledge. The revised labels are used to re-train the neural network in the next epoch.
Fig. 1 Architecture of the proposed method
The DNN classifier can be formalized asFc:X→Y, whereXis the preprocessed data andY∈Rnis the output space. For the training data {(xi,yi)}n i=1, the output of the classifier is the probabilitypθ(yi|xi) that inputxibelongs to categoryyi, andθdenotes the parameter of the neural network. The knowledge inference module can be formalized asFk:X? ×Y→C,C∈R+, whereX? is the raw data without preprocessing, andCis the degree to which that input data matches the label. Input dataX? is different from the preprocessed dataXin that it is not chopped or padded into segments of fixed length to make it available for DNN processing, which will cause valuable information to be lost with the abandoned segments.
The objective of the framework is to train the neural network under constraints, to simultaneously minimize the classification mismatch and penalize the violation of the knowledge base. The cost function can be represented as
LCis used to force the sample to fit the real label, and LKis used to penalize the violation between the two outputs of the two modules.λis a hyperparameter to trade-off between the knowledge inference and deep learning model.
wherelis the cross-entropy loss function.
LKis measured with Kullback-Leibler divergence (Sankaran et al., 2016) in each training iteration:
wherepk(?) is the knowledge inference module soft prediction, detailed in Section 3.2.2.
3.2.1 Presentation of knowledge
We use fuzzy logic rules to represent the domain knowledge. An atom is a tuple in the formp(x1,x2, …,xm), wherep∈P, a given set of base predicates, andxiis either a variable or a constant. A predicatepis a relation defined by a unique feature extracted as the attributes of an object according to the domain knowledge, such as the permitted value range of the feature. A ruleris a Horn clause of disjunctive predicates with one term in the conclusion part, and each rule is associated with a weightηrto present the empirically preconfigured confidence of the rule, which can be initialized as 0 and should be updated and learned during training.
Father-let me come! he said, and he glanced at Martin and acrossthe waves; every oar bent with the exertions of the rowers as thegreat wave came towards them, and he saw his father s pale face, anddared not obey the evil impulse that had shot through his brain
The physician s failure to recognize that he is subordinate to his godfather, and that despite his privileged position as godson he too is mortal, leads to his downfall. Note that he is not satisfied with wealth and power, but now strives to marry a princess and win a crown. What is considered a legitimate30 goal in tales of magic becomes a mark of hubris31 in this tale. (198)
The rules are stored in the knowledge base. When the training data is input, the features are extracted and the corresponding predicates are grounded. A grounded predicate is the instantiation of all the variablesxi. The set of grounded predicates is also called the Herbrand base, denoted asG. A rule is grounded by grounding all the predicates of the rule iteratively.
and ?pθcan be computed using the usual neural network backpropagation.
Table 1 The soft truth computation of ?ukasiewicz’s t-norm
Then he ran and seized the coverlet, but as soon as he did so it sounded so that it could be heard over eight kingdoms, and the witch, who was at Troms Church, came flying home, and shouted, Hey! is that you again, Esben ? Ye--e--s! It was you that made me kill my eleven daughters? Ye--e--s! And took my dove? Ye--e--s! And my beautiful boar? Ye--e--s! And drowned my twelfth daughter in the well, and took my lamp? Ye--e--s! And now you have roasted my thirteenth and last daughter in the oven, and taken my coverlet? Ye?e?s! Are you coming back again? No, never again, said Esben
The data input into the knowledge inference module is a complete signal without segmentation or dropout to compensate for the information lost in the preprocessing. Specific features will be extracted from the raw data. The GMS module will use the features to iteratively ground the variables in the rules, determine the satisfied rules, and compute the score of the satisfied rules. The mapping from features to atoms is called an interpretationI. The process is described as follows:
1. Atom translation: When the training data is input, the features are extracted and the corresponding predicates are grounded.
2. Predicate translation: There is a given set P of base predicates determined according to the domain knowledge, and the predicates are defined asp(x1,x2, …,xm). The predicates are grounded aspor its negation ?p.
3. Proposition translation: The proposition is translated into a combination of predicates with logical operator conjunction (∧) and disjunction (∨).
4. Rule translation: For a rulerbody→rhead, the soft truth of the antecedent and consequent of the rule are computed asI(rbody) andI(rhead), respectively, according to Table 1, and the distance under interpretationIto satisfy the rule is defined asdr(I)=max(I(rbody)-I(rhead), 0).
Given the grounded atoms, the GMS module derives a distribution over possible interpretations, and the probability density function is defined as
We aim to minimize the distance to rule satisfaction for each instance. We compute the distance with the GMS module and find the minimum of all possible rule grounding results.
The loss function L can be solved if it is convex. By relaxing the logic rules using ?ukasiewicz’st-norm and limiting the rules as a Horn definite clause, the convexity of LKis guaranteed and the loss function can be optimized with the GMS method. Details of the convexity proof can be found in Giannini et al. (2019).
The hyperparameterλin Eq. (1) creates a tradeoff between the impact of the DNN module and knowledge module. It is sampled from a Beta distribution (Beta(β,β)). The hyperparameter is selected by observing the bestF1performance on the validation set, as shown in Fig. 5. We test the model performance under different choices withβ=0.1 and setλto 0.1.
where
In the final examination19, four of us got the scholarship (7 in all). To be honest, I should have been proud of them, but not, because I didn t get it because of the bad train scores. In this aspect I am selfish. At the same time it s a motivation for me to work hard. The atmosphere of studying in our dormitory is good, and we encourage each other! This is a very positive aspect. And negative one, maybe there is no. So I consider our dormitory() perfect.
Letηdenote the weight of the logic rules. The gradient of L w.r.t.ηcan be computed as
This section provides a concrete instance of our general framework in the task of ECG arrhythmia classification. We test the method in detection of eight arrhythmias against normal records from 12-lead ECG signals. The arrhythmias include atrial fibrillation (AF), first-degree atrioventricular block (I-AVB), left bundle branch block (LBBB), right bundle branch block (RBBB), premature atrial contraction (PAC), PVC, ST-segment depression (STD), and ST-segment elevation (STE).
Fig. 2 illustrates the baseline neural network architecture. The input signal in the form of 12×5000 is fed into the first convolutional block, followed by eight convolutional blocks with residual connection and a classification layer. The convolutional blocks have the same structure except for the first and last.
Fig. 2 The DNN model to detect eight arrhythmias against normal records from 12-lead ECG signals
The first convolutional block consists of a one-dimensional convolutional (1D Conv) layer, a batch normalization (BN) layer, and a rectified linear unit (ReLU) layer. BN is the operation ensuring that the dataset has zero mean and unit variances to minimize the impact of an internal covariate shift (Ioffe and Szegedy, 2015), which is the phenomenon that the input distribution of each layer will change with the parameters of the previous layer in the training phase. BN transformation can be added to a network to manipulate any activation and enable a higher learning rate.
For the next eight blocks, each block consists of two convolutional layers. The filter sizes of all the convolutional layers are 16 and the number of filters is 32×2k, wherekstarts at 0 and increases by 1 for every four blocks. According to the pre-activation block design, we apply a BN and an ReLU layer before each convolutional layer. We apply residual connection by adding a shortcut connection between two consecutive convolutional blocks. The outputs are added to the outputs of the skipped block. Max pooling is an operation that computes the maximum value of a particular feature and reduces the dimensionality of the output features significantly while enabling a translation invariant of the features. We use max pooling of size 2 and stride 2 in the residual connection to guarantee that the input and output feature maps have the same dimensionality.
The last convolution layer is used to integrate the feature vectors produced. The output of the last convolutional block is fed into a SoftMax regression layer, which corresponds to the probability distribution of the label to which the input ECG segment belongs. A fully connected (FC) layer contains nine cells corresponding to the nine categories.
The squeeze-and-excitation (SE) module is applied to refine the channel-wise feature maps. As shown in Fig. 3, the SE module consists of a global average pooling (GAP) layer, and two FC layers, each with different activation functions. Given the input feature vector asX, the GAP layer will squeeze the global spatial information into a channel descriptor to capture channel-wise dependencies. The SE module will produce a scalarsto represent the importance of the channel in Eq. (8), whereδrefers to the ReLU function andσrefers to the Sigmoid function. The refined feature vector is shown in Eq. (9), wheres·Xrefers to the channel-wise multiplication between feature vectorXand scalars.
Fig. 3 The squeeze-and-excitation (SE) module
Domain knowledge is used in ECG arrhythmia detection to explore characteristics and improve the classification performance. Knowledge-based rules are aligned with diagnosis criteria according to the cardiologist’s experience and carry clinical meanings.
As shown in Fig. 4, one cardiac cycle in an ECG signal consists of the P-QRS-T waves. The P wave represents atrial depolarization, the QRS complex represents ventricular depolarization, and the ST segment and T wave represent ventricular repolarization (Goldberger et al., 2017). Considering that the symptoms of arrhythmias are different in each lead, the diagnostic rules of cardiac arrhythmia are extracted based on prior knowledge and clinical experience.
Fig. 4 The cardiac cycle in an ECG signal
When the sinus rhythm is normal, the P wave of lead II is always positive, the P wave of lead aVR is always negative, and the heart rate is between 60 bpm and 100 bpm.
The diagnostic for BBB is performed mainly in a widened QRS complex greater than 0.12 s. RBBB will result in the right ventricle depolarizing after the left ventricle, which can be reflected by leads I, V6, and V1 (indicating the slow depolarization of the right ventricle in a left-to-right direction). Associated features of diagnostic criteria for RBBB include a wide slurred S wave in leads V5 and V6, ST segment depression, and T wave inversion in lead V1. LBBB will result in the left ventricle depolarizing after the right ventricle. Associated features of LBBB include long R waves in leads V5 and V6 and a long S wave in lead V1 (Hamad, 2018).
STD and STE are the most widely used features for detection of ischemic disease and myocardial infarction (MI), which is measured as the height difference between the J point and the reference line. The J point is at the end of the QRS complex and the beginning of the ST segment. The PR segment is used as the reference line for measuring the deviation of the ST segment. It is STE if the J point is 0.2 mV higher than the baseline, and STD if the J point is 0.05 mV lower than the baseline in leads V2 and V3 (O’Gara et al., 2013; Hanna and Glancy, 2015; Gupta et al., 2020). V5 is selected because it has the highest sensitivity in detecting myocardial ischemia (Crawford et al., 1999). Lead aVL is more reasonable for diagnosing MI caused by left anterior descending (LAD) coronary artery occlusion, especially extensive anterior MI (Acharya et al., 2019).
The characteristic of AF is small waves of high frequency (350-600 bpm). The diagnosis of AF is the absence of P waves in all leads and short, irregular RR intervals. Atrial flutter and AF are related arrhythmias and often have similar appearance. The distinct features of AF are the totally irregular rhythm and variable wave morphology, which are constant and identical, respectively, in atrial flutter (Goldberger et al., 2017).
11. Heard: Some critics have considered Hansel and Gretel to be a subversive65 tale, encouraging children to eavesdrop66 on their parents, trespass67, commit murder, and steal property. The children are not ideal role models in the conservative sense, but one can credit them for being survivors68 in a harsh world. If they had not done these things, they would most likely be dead.Return to place in story.
AVB is characteristic of the prolonged PR interval. I-AVB occurs when the PR interval is ≥0.20 s. The associated clinical diagnosis criteria also include the electrical axis of the QRS complex. The normal mean QRS axis in adults lies in [-30°, +100°], and the left deviation of the electric axis (<-30°) is a noteworthy manifestation (Goldberger et al., 2017).
One week, he was in very good spirits. This followed several weeks when he was either too ill to come or he had suffered seizures in the car and was forced to miss his lesson with the horses. But that day, he smiled. He seemed alert5 and willing.
PAC can be diagnosed based on the P wave characteristics. Compared with the sinus P wave, a premature P wave has a different morphology and axis. A reverse P wave in lead II or III is a sign of PAC. In addition, it occurs earlier than the sinus P wave. A prolonged PR interval increases the probability of PAC. Lead aVR is used in detection (Gorgels et al., 2001).
He soon arrived in the town where the mist-veiled queen reigned34 in her palace, but the whole city had changed, and he could scarcely find his way through the streets
PVC is recognized from a QRS complex that is wide (≥0.12 s) and abnormal in appearance. The premature ventricular impulse will replace a sinus beat and disrupt the regular interval between beats, which will lead to a prolonged RR interval.
The associated features are summarized in Table 2.
In this work, the dataset used is obtained from the China Physiological Signal Challenge (CPSC) (Liu FF et al., 2018), which includes 9831 12-lead ECG recordings sampled at 500 Hz. The training set is open to the public and the testing set is private. To validate our model with more data and augment the dataset to reduce class imbalance, we incorporate the PTB-XL database (Wagner et al., 2020). The records are shown in Table 3.
Table 2 ECG features extracted based on domain knowledge
Table 3 Number of recordings of datasets
To reduce the effect of class imbalance, we randomly divide the records of each class into five subsets and copy the records of the class with fewer records so that the number of records of each class is nearly equal. The five subsets are processed to perform five cross validations.
3.2.2 GMS module
We divide the public accessible records at a ratio of 70%:10%:20% randomly for training, validation, and testing, respectively. Every recording is labeled as the normal type or one of the eight abnormal types. For a recording with more than one label, the classification result is considered correct if it is consistent with one of the labels. Before being fed into the model, all the ECG signals are denoised and filtered to remove baseline wander using a Daubechies 6 wavelet (Singh and Tiwari, 2006).
As we walked slowly down the street, my father came toward us. He signed solemnly. Do not be angry at Ben. I love you, daughter Ruth. You will go to university. I will go with you. You will teach me.
The DNN model requires the input signal be a fixed segment. The length of CPSC recording varies from 6 to 60 s. The standard 12-lead ECG recording length is 10 s. These raw signals are preprocessed to a fixed length of 10 s. For shorter recordings, we pad shorter recording to achieve 10 s with data points copied from the same recording; for longer recordings, we split the long signal into several segments with a length of 10 s and input only one segment into the model. To prevent the model from overfitting, we input the different segments of the same recording in a different training epoch. There are 5000 preprocessed signal samples for each channel.
Signal cropping will inevitably lead to loss of information. That is why we use the complete record for the knowledge inference module. The records do not need cropping, but do need further slicing into heartbeats to extract domain features. The ECG signals are segmented according to the location of the R peak using the Pan-Tompkins algorithm (Pan and Tompkins, 1985), which is regarded as the identification of a heartbeat. The length of each heartbeat is fixed at 600 ms (200 ms before the R peak and 400 ms after) with 300 sample points. The features described in Table 2 are computed based on the heartbeat segmentation.
The proposed model is developed and trained using Python with the TensorFlow library (Abadi et al., 2016). The experiments are performed on a computer with one Intel Core i9-9900K CPU at 3.6 GHz, NVIDIA Quadro RTX5000, and 64 GB memory. The Adam optimization method (Kingma and Ba, 2015) is used to optimize the model with the learning rate=0.001, beta1=0.9, and beta2=0.999. The procedure is conducted five times to complete the fivefold training and validation plus test.
In our experiments, the performance of the proposed model is evaluated with the following statistical measures as shown in Eqs. (10)-(13): sensitivity (Sen), specificity (Spe), precision (Pre), and accuracy (Acc). Sen measures the ability of the model to avoid missing an abnormal heartbeat, and Spe evaluates how well our model avoids misjudging a normal heartbeat. Pre measures the correctly predicted positive observations. Acc represents the overall performance of the model in properly classifying a heartbeat. True positive (TP) and true negative (TN) indicate the numbers of heartbeats correctly predicted, while false positive (FP) and false negative (FN) indicate the numbers of heartbeats not predicted as labeled.
For each classx, theF1score is denoted asF1xand computed using Eq. (14), and the averageF1score of the model is evaluated as Eq. (15):
Once again, I was caught in the middle of circumstances. The fourth born of six children, it was not uncommon4 that I was either too young or too old for something. This night I was both. While my two baby brothers slept inside the house, my three older siblings5 played with friends around the corner, where I was not allowed to go. I stayed with Grampy, and that was okay with me. I was where I wanted to be. My grandfather was baby-sitting while my mother, father and grandmother went out.
The performance is shown in Table 4.
Clark was much older-seventy-eight to Allison s thirty-five. They were married. They were both quite tall and looked something alike in their facial features. Allison wore a natural-hair wig5. It was a thick blonde hood8 around her face. She was dressed in bright-dyed denims today. She wore durable9 clothes, usually, for she volunteered afternoons at a children s daycare center.
Table 4 Performance of the proposed model
To evaluate the effectiveness of our proposed model structure, we compare the performance measures of the proposed model with those of two other models. The first model (denoted as Expert in Table 5) uses the domain features described in Table 2 as the input of a classifier. We build a logistic regression on the extracted features. The second model (denoted as DNN) uses the DNN model described in Fig. 2, which uses convolutional neural network (CNN) blocks to extract the features of each lead, concatenates all 12 feature vectors together with a fully connected layer, then inputs the concatenated feature vectors to the classification layer, and outputs the probability distribution of the arrhythmia type. TheF1scores of the three models are shown in Table 5.
Table 5 F1 score in form “Mean±STD” of different models in the fivefold cross-validation
To demonstrate the effect of domain knowledge on the performance of the classifier more directly, the confusion matrices without and with domain knowledge are shown in Tables 6 and 7, respectively. The confusion matrix records the actual and predicted classifications for each class and identifies the type of errors being made by the classifier. The row labels indicate the true class records to which each row belongs, and the column labels indicate the class predicted by our model for records in each column. Numbers in each grid show the number of records classified as the column label when its true class is indicated by the row label.
Table 6 The confusion matrix of the DNN model
Table 7 The confusion matrix of the proposed model
In the classification of ECG arrhythmia, there are some domain-specific issues making the result unsatisfactory, leaving space to introduce the augmentation of domain knowledge. The issues can be summarized as follows:
Things were in this state, and the Princess was about fifteen years old, when Prince Narcissus, attracted by the report of Queen Frivola s gay doings, presented himself at the court
1. The influence of lost input data information: DNN models require input data be preprocessed into segments of a fixed length, which may lead to loss of important information. For PAC or PVC, the premature beat appears just a few times in the record, while other arrhythmias, such as AF, appear in each ECG beat. In extreme cases, AF beat appears only once. For the DNN model, the beat will be neglected because the record may be cropped and the characteristic beats are abandoned. In this case, the record will be misclassified in the NSR. We remedy this issue with the knowledge module, which takes the complete record as the input without cropping. The module magnifies the importance of specific important concepts missing from the learning model.
2. The influence of the similarity among classes: The similarity among classes will lead to high false positive cases. From the confusion matrix of the DNN model in Table 6, we can see that the DNN model is not sensitive to STE and STD detection. The small change of the ST segment amplitude is easily affected by noise, baseline drift, and subject variability. STD and STE can be misclassified into NSR, which makes their recognition from the training set a difficult task. The characteristic rules of specific leads aim to reduce the misclassification. Similarly, for the further classification of AF and atrial flutter, which are often misclassified for the morphology similarity, the difference between heart rates can be used as a distinguishing rule.
3. The influence of features of different importance: One important DNN model issue is that the influence of one feature is trivial and may be neglected if other features are normal. For example, atrial rhythm and sinus rhythm are easily confused. The pathological characteristic is P-wave anomaly. It is hard to distinguish when the amplitude of the P wave of a specific subject is very small and other features fall into the normal range. However, the logic rules can amplify the significance of a specific feature, and thus focus on the most discriminative part of the signal.
Letθdenote the parameter of the neural network. The gradient of L with respect to (w.r.t.)θcan be computed as
Fig. 5 Hyperparameter search for λ
Whenλ=0, the model regresses to a traditional CNN model. Asλgrows, the performance is improved, which shows that logical rules of the knowledge module are essential for fallible categories with very similar patterns or ignored features. However, a too largeλwill lead to reduced performance, because the power of automatically extracting nonlinear relation of the neural network may be significantly weakened by the logical rules, leading to high sensitivity and low precision. In addition, the knowledge module is domain-specific and is highly constrained by classification accuracy and representation power, and thus the parameter will impact the generalizability of the model.
In summary, a proper weight of the domain knowledge module is helpful in unifying the advantages of neural networks and logic reasoning. It should be estimated in a task-specific way.
The learning rate and batch size impact the performance of the model. We conduct two contrast experiments: one experiment involves a different learning rate and an unchanged batch size, and the other involves a changed batch size with a fixed learning rate of 0.001.
The model is trained for a total of 50 epochs. Fig. 6 presents the loss curves with the batch size of 64. We test the learning rate of 0.01, 0.001, and 0.0001, and find that the model converges to a very low value with an increased epoch number and a different learning rate. With a learning rate of 0.001, the loss curve shows a stable convergence trend close to the value of 0, while the two other curves exhibit fluctuations during training.
Fig. 6 The loss curves at different learning rates
By fixing the learning rate at 0.001, we test the model with different batch sizes. As illustrated in Table 8, the best performance is achieved at the batch size of 64. When the batch size is larger than 64, theF1score decreases as the batch size increases.
Table 8 Performance when using different batch sizes
The average running time is about 70 s. Note that the model converges in a few minutes, also depending on the size and structure of the knowledge inference rules. The inference rules are designed in a concise and clear way to avoid recursive inference. Fortunately, rules in ECG classification are different from commonsense reasoning. For example, given two facts “Tom is Alice’s wife” and “John is Tom’s son,” a new fact, “John is Alice’s son,” can be deduced and the process can keep working until no new fact is generated. This technique is called forward chaining, and will result in a deep proof path. The training time will depend on the scale of the proof path. ECG classification rules avoid the issue because two arrhythmias or more will not infer the presence of a new arrhythmia.
We conduct a comparative study of the proposed method and the state-of-the-art methods. The most frequently used neural networks in ECG classification tasks include CNNs, recurrent neural networks (RNNs), and their combination, convolutional recurrent neural networks (CRNNs).
CNNs have proved to be a very powerful and effective model in extracting sophisticated features, and are popular in different classification tasks including ECG signal classification. The ECG signal is sampled to be time series, so one-dimensional convolutional neural network (1D-CNN) is a preferred option. Although ECG segments can be transformed into twodimensional representation to adapt to the conventional network, we still take time series as input to avoid introducing confounding factors and facilitate performance comparison. We conduct experiments with three popular CNN models as listed in Table 9: InceptionTime (Fawaz et al., 2020) (INCE for short), ResNet (He et al., 2016), and VGGNet (Simonyan and Zisserman, 2015). The model inputs are tensor of 5000×12 and the last FC layer is re-adapted to exclusively work with nine classes. ResNet includes one convolutional layer, eight residual blocks with two convolutional layers per block, and one FC layer. A kernel size of 5 is used in the 1D convolutions. VGGNet includes 16 1D convolutional layers with a kernel size of 3. INCE includes six inception blocks with kernel sizes of 40, 20, and 10 in each block. The experiment details are the same as in our experiment setup.
RNNs are natural for time-series data. We investigate long short-term memory (LSTM) (Mostayed et al., 2018), which comprises two hidden recurrent layers with 100 recurrent cells each and one FC classification layer. In most cases, an RNN is applied in combination with a CNN, i.e., CRNN, where the CNN is used as the feature extractor and the RNN is used to catch the time dependence of the time series. There are three different structures: CNN with LSTM (Luo et al., 2019), CNN with GRU (Chen TM et al., 2020), and CRNN with the attention module (Yao et al., 2020). To make the comparative study valid and sound, we select the studies using the same dataset and with approximately equal network depths.
Table 9 shows recent ECG classification results with bold data denoting the best performance. The experiment results show that, although the proposed model is not the best for some specific classes, it achieves the highest averageF1score. The arrhythmia classes with the greatest performance improvement are PAC, PVC, STD, and STE. STD and STE could be misclassified as NSR without focusing on the deviation of the ST segment. PVC and PAC are characteristic of the premature beat, which occurs arbitrarily in an ECG recording. A fixed-length input of the CNN may lead to characteristic information loss and make it similar to the normal class. The knowledge module compensates for this by taking advantage of a domain-specific determinant.
Table 9 Performance comparison between the proposed method and the state-of-the-art methods
The next two best models are ResNet and CRNN with an attention mechanism. In comparison with the two models, our work achieves an increase of 5.4% and 9.8% on average, respectively.
The LSTM model alone does not perform well on the task, but the combination with a CNN leads to significant performance improvement due to the excellent power of extracting nonlinear CNN features. Note that we do not examine the RNN model with carefully designed input, which might achieve competitive performance as their convolutional counterparts.
In summary, our model attains similar or competitive results when compared to the available stateof-the-art models. Learning with knowledge injection will produce more representative features, thus avoiding overfitting. The rich feature space in the process of knowledge injection learning improves the sensitivity and specificity of the model. Compared with the above-mentioned methods, we believe that infusion of domain knowledge into the DNN model will reduce false alarms, improve interpretability, and provide robustness for practical applications.
In this study, we propose an automatic classification model for cardiac arrhythmia that combines DNN and domain knowledge. The model consists of a DNN to capture the statistical pattern between input data and the ground-truth label, and a knowledge module to guarantee consistency with the domain knowledge. These two components are trained interactively to bring the best of both worlds.
Our method answers the questions raised in Section 1 as follows: (1) Domain knowledge is represented by fuzzy logic rules, which can map a proposition into a real value in the range [0,1], making the truth degree comparable to the probability vector. (2) Logic rules are indifferentiable but can be relaxed using thet-norm, so the derivation can be computed and the gradient descent method can be applied to train the model jointly. (3) The performance is improved because the knowledge inference module reduces the influence of lost input data information, similarity between classes, and features of different importance. Compared to the end-to-end DNN model, theF1score of each arrhythmia of the knowledgeenhanced model increases, which means that the domain knowledge is helpful in learning information that the neural network cannot exploit.
We have instantiated our method for the ECG arrhythmia classification task. The experiment shows that our model attains competitive results when compared to many existing approaches. The method can be applied to other decision-making fields to provide generalization, reduce data bias, and improve interpretability.
Compliance with ethics guidelines
Jie SUN declares that he has no conflict of interest.
Data availability
The data that support the findings of this study are openly available in China Physiological Signal Challenge 2018 at http://2018.icbeb.org/Challenge.html and PTB-XL database at https://physionet.org/content/ptb-xl/1.0.1/.
Frontiers of Information Technology & Electronic Engineering2023年1期