Jianke HU, Yin ZHANG
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Abstract:Recently,graph neural networks(GNNs)have achieved remarkable performance in representation learning on graph-structured data. However, as the number of network layers increases, GNNs based on the neighborhood aggregation strategy deteriorate due to the problem of oversmoothing, which is the major bottleneck for applying GNNs to real-world graphs. Many efforts have been made to improve the process of feature information aggregation from directly connected nodes, i.e., breadth exploration. However, these models perform the best only in the case of three or fewer layers, and the performance drops rapidly for deep layers. To alleviate oversmoothing, we propose a nested graph attention network (NGAT), which can work in a semi-supervised manner. In addition to breadth exploration, a k-layer NGAT uses a layer-wise aggregation strategy guided by the attention mechanism to selectively leverage feature information from the kth-order neighborhood,i.e., depth exploration. Even with a 10-layer or deeper architecture, NGAT can balance the need for preserving the locality (including root node features and the local structure) and aggregating the information from a large neighborhood. In a number of experiments on standard node classification tasks, NGAT outperforms other novel models and achieves state-of-the-art performance.
Key words: Graph learning; Semi-supervised learning; Node classification; Attention
Recently, deep learning architectures such as convolutional neural networks(CNNs)have been applied with great improvement in computer vision tasks(Krizhevsky et al.,2012;Simonyan and Zisserman,2014;He et al.,2016). However,unlike speech,image, and video data that lie on the Euclidean domains, many machine learning tasks involve data that cannot be represented in a grid structure, e.g.,social networks and biological networks, where the entities are interdependent through multiple relationships. Such data are normally represented in the form of graphs, in which entities and relations correspond to nodes and edges,respectively.
In recent years, researchers have become more interested in applying deep learning architectures to graph domains, such as Random Walks algorithms (Perozzi et al., 2014; Grover and Leskovec,2016; Ribeiro et al., 2017), which use the topology of the graph to directly train the individual node embedding, but discard node features and label information. Many other approaches extend the convolutional operator to graphs. For example, spectral graph convolutional networks (Bruna et al.,2014; Defferrard et al., 2016) design the convolutional layer based on graph spectrum analysis,while non-spectral networks(Niepert et al., 2016;Atwood and Towsley, 2016; Hamilton et al., 2017; Kipf and Welling, 2017) follow a feature passing scheme (or neighborhood aggregation)and achieve state-of-theart performance in many graph-structured tasks such as node classification and link prediction.
Theoretically,a single graph convolutional layer serves as a localized node feature aggregator in the first-order neighborhood (Kipf and Welling, 2017).Similar to CNNs, stacking multiple graph convolutional layers can yield a larger receptive field. In semi-supervised settings,as depicted in Fig.1,where the labeled nodes are very sparsely distributed, a larger receptive field allows the unlabeled nodes to benefit from the label information of further nodes.However, current work (Li et al., 2018; Klicpera et al., 2019) shows that neighborhood aggregation is equivalent to Laplacian smoothing, and that excessively increasing the range of neighborhood aggregation results in the problem of oversmoothing.Oversmoothing causes each node in a connected component to converge to a similar embedding, which means that the features learned by the graph neural networks (GNNs) become indistinguishable. In nature, the learned node embedding is not suitable to describe the node’s local information. Hence, with deeper layers,graph convolutional networks(GCNs)might show worse performance. The issue occurs in many GNN models using such a neighborhood aggregation strategy, and becomes the major bottleneck for these models to be scalable to large graphs.
Fig. 1 The first- and second-order neighborhood of the root node (drawn in blue) in an undirected labelsparse graph. The unlabeled nodes are drawn in grey and the labeled nodes are drawn in red
Since the neighbors of each node might not be equally important,some interesting work(Veli?kovi? et al., 2018; Lee et al., 2019) uses attention mechanisms to construct a dynamic neighborhood in each layer to improve the breadth exploration (feature passing among neighboring nodes). Recently,Knyazev et al. (2019) performed analysis on the interpretability of GNN models that incorporate the attention mechanism, and described how the attention works.
However, attention-based GNNs still encounter the problem of oversmoothing. Therefore, we try to alleviate this model degradation, and propose an adaptive neighborhood aggregation algorithm inspired by previous achievements. In addition to using the attention mechanism in breadth exploration,we are more concerned about the information aggregation in depth exploration. Ak-layer GNN can obtain the information in thekth-order neighborhood.To efficiently leverage this information from different depths, we collect the hidden embedding that is learned by each intermediate layer and assign different weights to the corresponding layers. Thus, each node can enhance the attention bias against different neighborhood depths, and fuse the features to learn the depth-adaptive embedding. Evaluation of the node classification benchmarks shows the representation learning ability of our model. In summary,the main contributions of this paper are as follows:
1. We propose an end-to-end deep architecture for graph representation learning. It directly accepts graphs as inputs without any feature engineering, and provides high-quality node embedding as outputs.
2.We propose a novel neighborhood aggregation algorithm to alleviate the problem of oversmoothing. As far as we know, this is the first attempt to design a depth-wise feature aggregation scheme in self-attention fashion.
3.The experimental results on a number of semisupervised node classification datasets show that our nested graph attention network (NGAT) is highly competitive compared to other novel models for node classification.
We begin by introducing our notations. Formally,letG={V,E,A}denote an undirected graph,which consists of a finite set of nodesVwith|V|=N,a set of edgesE, and a symmetric adjacency matrixA(its entryAi,jrepresents the weight of an edgee= (vi,vj)∈Econnecting the nodesvi,vj∈V;when there is no edge between nodesiandj,Ai,j=0). We useX∈RN×dito represent the node feature matrix, andXvrepresents thedi-dimensional feature vector of nodev.N(v)={u∈V|(v,u)∈E}denotes a neighborhood function that maps to the set of neighboring nodes ofv, and(v) =N(v)∪{v}maps to the neighbors and nodevitself.
Collecting labeled data is very time-consuming.Semi-supervised learning methods address this problem by maximizing the use of unlabeled data, together with the labeled data,to build a robust model.In graph domains, we useXLandYLto represent the feature matrix and labels, respectively, for labeled nodes. A commonly applied semi-supervised loss function with an explicit graph-based regularization is as follows:
Here,Lsupdenotes a supervised loss function,λis a weighing factor, andLregis a regularization term. Label information is smoothed over the graph via some form of graph-based regularization, e.g.,the graph Laplacian regularization:Lreg(X) =fT(X)Δf(X), wheref(·) can be a differentiable embedding function,Δ=D-Ais the unnormalized graph Laplacian matrix, andDis a diagonal degree matrix with entryThere are many pioneer algorithms based on graph Laplacian regularization such as label propagation (Zhu et al., 2003), where the node’s label propagates to neighboring nodes according to their proximity using the graph Laplacian matrix. Manifold regularization (Belkin et al., 2006) combines a support vector machine (SVM) classifier and Laplacian regularized least squares to compute the supervised loss. Other standard approaches that use graph regularization in semi-supervised graph learning have been summarized in Chapelle et al. (2009).
GNNs are deep architectures applied to graph domains,and have been widely used in many graphdata tasks. For the node classification task which aims to predict the node’s label using the node features and graph structure, GNNs are designed to learn an embedding functionf:V →Rdhto generate adh-dimensional node embeddingZv. Then, a classifier acceptsZvas the input to predict the corresponding labelyvof nodev(?v∈V). Modern GNNs follow a neighborhood aggregation strategy, where they iteratively update the embedding of a node by aggregating the embeddings of its neighbors. Afterkiterations of aggregation,the final embedding of a node captures the features of neighboring nodes and structural information within thekth-order neighborhood. For ak-layer GNN, we useh(l)vto indicate the embedding of nodevlearned by thelthhidden layer,and commonly,the embeddingh(k)vof the last layer is used asZvfor downstream tasks.
We have presented several traditional approaches on semi-supervised graph representation learning in Section 2. In this section, we provide a brief overview of the recent advancements in using deep neural networks that are related to our work.
GCN (Kipf and Welling, 2017) is a powerful neural network that achieves state-of-the-art performance in semi-supervised node classification on graph-structured data. A typical case of GCN stacks two graph convolutional layers as follows:
whereW(0)andW(1)are learnable weight matrices trained over all labeled nodes.σ(·) is a non-linear activation function such as ReLU(x) = max(0,x),applied in an entry-wise manner. The softmax function is defined as softmax(xi) =exp(xi) withFurthermore,denotes the normalized symmetric adjacency matrix:
whered(v) measures the degree of nodev(?v∈V).
GCN-like architectures use a fixed adjacency matrixin every graph convolutional layer, and ignore the difference between different neighboring nodes. Veli?kovi? et al. (2018) believed that designing an aggregation strategy that is aware of this imbalance can make GNNs more powerful, and that building a dynamic adjacency matrix also alleviates the problem of oversmoothing.
Graph attention network(GAT)designs a feedforward network to compute the attention valueev,ubetween nodesvandu:
whereWais a weight vector that is shared over all nodes. The attention valueev,uis regarded as the edge weight between nodesvandu,and it is recomputed in each layer; therefore, the adjacency matrix is dynamic.
Almost at the same time of the publication of GAT, there is a similar work,attention-based graph neural network(Thekumparampil et al., 2018),that uses a simpler attention mechanism but matches the state-of-the-art performance in some node classification benchmarks.
Besides the two kinds of neural networks mentioned above,many other GNNs have been proposed recently. FastGCN (Chen et al., 2018) samples the nodes in each layer according to the importance distribution. GIN (Xu et al., 2019) uses a sum aggregator over neighboring nodes and a multi-layer perceptron after aggregating node features. SGC(Wu F et al., 2019) removes the non-linear activation between consecutive layers and collapses the weight matrices. LanczosNet (Liao et al., 2019)leverages the Lanczos algorithm to approximate the graph Laplacian to collect multi-scale information from graph convolutions. The thorough survey of graph neural networks can be found elsewhere(Zhou et al.,2018;Wu ZH et al.,2019).
In this section, we present the architecture of the NGAT, which deploys the node-wise and layerwise attention mechanisms. The schematic layout of NGAT is shown in Fig. 2. Algorithm 1 presents the process for generating the node embedding using the NGAT that has already been trained, and the parameters are fixed.
Fig.2 Schematic layout of a K-layer nested graph attention network(NGAT)for graph representation learning.The architecture consists of three parts: (1) the embedding layer does a linear transformation with non-linear activation; (2) the node attention layer aggregates the feature information from neighbors in attention fashion;(3) the layer(-wise) attention layer selectively aggregates the feature information from different depths, and outputs the final embedding Zv
We start with an embedding layer to map node features into a low-dimensional vector space,so that the inherent features of the nodes are visible in layer-wise attention, as will be introduced in Section 4.2. It can be expressed over all nodes as a simple matrix multiplication:
Algorithm 1 Embedding algorithm of NGAT Input: graph G, node features Xv, neighborhood function with self-loops ~N(v),?v∈V Model information: depth K, parameter W, nodewise aggregation function N-ATTN(·), and layerwise L-ATTN(·)Output: node embedding Zv 1: h(1)v ←σ(XvW(0)),?v∈V 2: for l =1,2,...,K-1 do 3: for v∈V do 4: h(l+1)v ←N-ATTN({h(l)u ,?u∈ ~N(v)})5: end for 6: end for 7: h(K+1)v ←L-ATTN(h(1)v ,h(2)v ,...,h(K)v)8: Zv ←h(K+1)v ,?v∈V
whereX∈RN×diis the node feature matrix, the weight matrixW(0)∈Rdi×dhis trainable as a part of the whole model, anddhis the number of hidden units.
A node-wise attention layer aims to improve the breadth exploration. It proceeds along lines 3—5 in Algorithm 1. The node attention mechanism is denoted by the placeholder N-ATTN. In each layer,node embeddings are updated in two stages: attention coefficient calculation and feature passing. We describe each stage in detail.
1. Attention coefficient calculation
We draw inspiration from Thekumparampil et al.(2018),which used cosine similarity to measure the correlation between different nodes. A learnable linear transformation is applied to improve the expressivity,as follows:
whereW(l)∈Rdh×dhis a trainable weight matrix.Compared to GAT which uses multi-layer perception (MLP) to calculate the attention value, cosine similarity is more economic, resulting in substantial savings in the gradient computation, as it does not need new parameters. To normalize these coefficients across all choices of different nodes,we use a softmax function that maps these coefficients into the probability distribution:
In each layer, the attention coefficients are calculated only in the first-order neighborhood of nodev(?v∈V), denoted as ~N(v).
2.Feature passing
Feature passing is the key step of feature aggregation over the graph. The attention-guided neighborhood aggregation can be summarized as a weighted linear combination across all choices of neighboring nodes and the node itself, as follows:
More compactly, in matrix form, an attention layer with a propagation matrixP∈RN×Nis defined as
Similar to the adjacency matrixA, the entryequals to the positive attention coefficient only when nodeuis directly connected to nodev, i.e.,otherwise,. The feature passing is dynamic because the propagation matrix may change over the layers. Moreover,using softmax normalization,each row ofP(l)sums to one,which ensures that the hidden state of the node will not be scaled after feature passing in the attention layer.
For node classification, many other GNNs use the node embeddingh(K)vof the last iteration to predict labels. Yet the features of beginning iterations may preserve locality (close to the root node), and they are better for prediction. Thus,after executingKiterations of layer transformations that include the first embedding layer and the subsequent nodewise attention layers, the input to the layer-wise attention layer is a finite set of all hidden embeddingsfor nodev. To consider an efficient depth exploration, we provide two attention mechanisms that are operated on all hidden embeddings: parametric and non-parametric attention mechanisms. This procedure is denoted by the placeholder L-ATTN in Algorithm 1.
1. Parametric attention mechanism
The parametric attention mechanism works in the self-attention setting, which allows the input features to be the criteria for the attention itself(Vaswani et al., 2017). As depicted in Fig. 3a, we use a single-layer perceptron to calculate the attention coefficients. It can be expressed as follows:
Here,Hv∈RK×dhis the matrix that consists of all hidden embeddings ofKlayers,Cv∈RKis the calculated attention coefficient vector (the subscriptvindicates that these notations are related to nodev,?v∈V),Θ∈Rdhis the learnable attention parameter shared across all nodes, andis a scaled operation.
To obtain sufficient attention information, in practice, we use the multi-attention mechanism to make the model more powerful.Mversions of the self-attention mechanism perform the transformation shown in Eq. (6) in parallel, and then yield a concatenatedMdh-dimensional output vector as the final representation:
Here,Civis theithattention coefficient vector calculated by the corresponding attention head, parameterΘin each attention head is independent, and‖indicates concatenation.
2. Non-parametric attention mechanism
We also use a simpler attention mechanism that proceeds without any parameters. Similar to the mechanism applied in the node-wise attention layer,cosine similarity is used here to calculate the attention coefficients. As illustrated in Fig. 3b, we measure the correlation between theKthhidden embedding (final iteration) and the previousK -1 hidden embeddings. The cosine function is defined as cos(x,y) =xy/(‖x‖‖y‖). After computingK -1 normalized attention coefficients, a weighted summation is performed to obtain the output embeddingh(K+1)v. The complete non-parametric attention mechanism is as follows:
Fig. 3 Two versions of layer-wise attention: (a) parametric attention mechanism (self-attention); (b) nonparametric attention mechanism
TheKthhidden embeddingh(K)vis not used in layer aggregation, and it is considered only as the attention criterion for coefficient calculation.
Here, we summarize the node embedding procedure of NGAT. First, an embedding layer is used to transform the input features into higher-level features to obtain sufficient expressive power, and also to retain the inherent features of the nodes when feeding into the layer-wise attention layer. Next,we useK -1 node-wise attention layers followed by a layer-wise attention layer. Node-wise attention improves the neighborhood aggregation strategy by distinguishing the importance of each neighboring node,and layer-wise attention is to selectively leverage these hidden embeddings of all layers to generate an informative embeddingZvfor nodev(?v∈V).
Similar to a standard MLP,we use linear regression with a softmax classifier for node classification:
Here,S∈RN×dcis the probability distribution of class prediction (dcis the number of classes),Z∈RN×dhis the final node embedding matrix,andW(K+1)∈Rdh×dcis the trainable weight matrix.The softmax function is applied in a row-wise manner. Note that if we apply the multi-head attention mechanism in layer-wise aggregation, the final node embedding will containMdhfeatures. Accordingly,the weight matrixW(K+1)∈RMdh×dc, whereMis the number of attention heads. For semi-supervised node classification, we jointly train NGAT and the multi-class softmax classifier using the gradient descent method. The optimization goal is to minimize the cross-entropy loss over all labeled nodes:
whereYLis the set of node indices that have labels.
The time complexity of a single node-wise attention layer computingdh-dimensional embedding may be expressed asO(Nd2h+|E|dh), whereNand|E|are the numbers of nodes and edges in the graph,respectively. The time complexity of the layer-wise attention head aggregatingK-layer hidden embeddings may beO(NKdh). The memory complexity isO(N2). If using the sparse representation for the adjacency matrixAand the node feature matrixX,the memory requirement can be reduced toO(|E|).
We evaluate NGAT on semi-supervised node classification datasets and compare its performance with those of other powerful GNNs. Our experimental results show that NGAT is highly competitive in learning representation of graph data.
We use five benchmark datasets for our experiments. Three are well-known citation networks,i.e.,PubMed (Namata et al., 2012), Cora, and Citeseer(Sen et al., 2008), where nodes and edges represent documents and citation links,respectively. The other two are new co-authorship networks: Microsoft Academic CS and Academic Physics from the KDD Cup 2016 Challenge (https://www.kdd.org/kddcup/view/kdd-cup-2016/Data). All datasets use the bag-of-words representations as node features, and the class labels indicate the research fields that each node (document/author) belongs to. The statistics of the datasets are reported in Table 1. We treat all graphs as undirected versions,and the adjacency matrix of each graph is binary.
Table 1 Statistics of the node classification datasets used for experiments
We construct an NGAT model with five nodewise attention layers (K= 6 with the first embedding layer). Each hidden layer has a fixed feature size ofdh=32. For the layer-wise attention mechanism,we evaluate both parametric and non-parametric versions, denoted by P-NGAT and NP-NGAT, respectively. P-NGAT applies the multi-head attention with four heads (M=4). For optimization, we choose the Adam optimizer (Kingma and Ba, 2014)with learning rate of 0.005. We fix the dropout(Srivastava et al., 2014)rate to bep= 0.5 applied after all linear layers and node-wise attention layers, and add anL2regularization withλ= 0.0005 on the model parameters. In practice, we have tried other hyperparameter settings and evaluated them on the validation set but did not find much difference. We train for a maximum of 500 epochs. However, the actual training time is considerably shorter since we use an early stopping criterion. Specifically,training stops if the validation loss does not decrease for 20 epochs. We reset the parameters of NGAT to the state with the lowest validation loss. In each epoch,we feed all training nodes as a batch, and the model parameters are updated through back propagation to reduce the cross-entropy training loss. For baseline approaches, we test GCN (Kipf and Welling,2017), AGNN (Thekumparampil et al., 2018), GAT(Veli?kovi? et al., 2018), and JK-Net with GCNfeature concatenation (Xu et al., 2018) from the released implementations, and thereafter, cite the performances of other novel GNNs such as DGI(Veli?kovi? et al., 2019), SGC (Wu F et al., 2019),LanczosNet (Liao et al., 2019), and AdaLNet (Liao et al., 2019)from their original papers.
1. Planetoid split
First,we evaluate all models on a common semisupervised train/dev/test set according to Planetoid split (Yang et al., 2016), i.e., 20 nodes per class for training, 500 nodes for validation, and 1000 nodes for test. Except the training set, the remaining nodes are label-invisible during training. The quantitative details are presented in Table 1. Table 2 reports the experimental results on Planetoid split.For each dataset,we evaluate the performance of all models for 10 runs, and compute the average test accuracy with standard deviation. We observe that both NP-NGAT and P-NGAT achieve the best accuracy and significantly outperform all baseline models (p <0.05 based on Student’st-test) on all five datasets, which shows that NGAT is highly competitive on the node classification task. Compared to JK-Net with six layers, which is similar to our model that uses a final aggregated layer for all hidden embeddings, the classification accuracy of using P-NGAT is improved by 2.5% on Cora, and 3.2% on PubMed. NP-NGAT also shows the same classification boost. This improvement shows that our attention-guided layer-wise aggregation is suitable for learning a structure-aware representation.Furthermore,NP-NGAT and P-NGAT both achieve significant performance in experiments, but there is a slight difference in the relatively large graphs(PubMed, Academic CS, and Academic Physics,where the number of nodes is 10 times that of Cora and Citeseer) such that P-NGAT outperforms NPNGAT by 0.6%roughly.
2.Random split
In the Planetoid split experiment,we sample the same number of labeled nodes per class for training.In the second experiment, we use Cora, Citeseer,and PubMed citation datasets,and test all the models on random split. The experimental settings follow Buchnik and Cohen(2018)with some slight changes.We retain the train/dev/test set of the same size as in the Planetoid split, but randomly sample the labeled nodes for training. For example,there are still 140 labeled nodes for training in the Cora dataset,but different classes might have different numbers of labeled nodes in the training set. We evaluate the performance of all models in such a random split for 10 runs,and in each run,we carry out a new random sampling and report the experimental results in Table 3. P-NGAT and NP-NGAT still achieve the best classification accuracy on the three datasets. Compared to the results in Table 2, the test accuracyof all models shows a decrease by about 3%—4% in the random split experiment. It is worth noting that JK-Net and NGAT have a relatively small drop compared to GCN,GAT,and AGNN.We think that it is due to the layer aggregation mechanism. As the node classes in the training set are unfairly distributed, a model is prone to be optimized toward the classes involving more labeled nodes. For those classes with fewer labeled nodes, the classification variance will accumulate as the number of layers increases. Hence,the features of earlier iterations are more helpful.JK-Net and NGAT both use node embeddings from all iterations while other models use only the node embedding of the final iteration,so the classification accuracy of models using layer aggregation is better than those of others.
Table 2 Node classification accuracy on Planetoid split from Yang et al. (2016)
3. Different aggregators
Table 4 shows the effect of using self-attention for layer-wise feature aggregation. We report the performance of self-attention (Self-Attn) and compare it to those of other simple aggregators that fuse the output of all hidden layers. The baseline“None”means not to use layer-wise aggregation. We explore three simple aggregators: Concatenation (Concat)directly combines the features of all layers; Max-Pooling (MaxPool) is the most straightforward way to select the informative layer for each feature coordinate; AveragePooling (AvgPool) is applied in a point-wise manner to fuse all features equally. We construct a model using five node-wise attention layers;in each experiment,we combine this base model with different aggregators, and then report the test accuracy averaged over 10 runs. The train/dev/test set is split according to the Planetoid split experiment. It can be easily seen that compared to the“None” aggregator,using layer-wise feature aggregation can greatly improve the accuracy of node classification. Moreover, the result of using self-attention is better than those of the other three simple aggregators, which shows that our self-attention aggregator can learn node embeddings with abundant structural information very well.
Table 3 Node classification accuracy on random split for citation network datasets
Table 4 Test accuracy of different layer-wise aggregators on citation network datasets
4. Size of training set
Since labeled data are very important for semisupervised learning,in the next experiment,we look into the influence of different numbers of labeled training nodes on classification performance. We retain the validation and test set of the same size as in the previous experiments,and change the size of thetraining set within{5,10,20,30,40,50,60}labeled nodes per class. Here,we choose AGNN rather than GAT as the baseline of attention-based models, because our breadth exploration uses cosine similarity to calculate attention coefficients, which is similar to the setting of AGNN. Therefore, the comparison with AGNN will show the improvement of NGAT using layer aggregation. We run GCN, AGNN, JKNet, and P-NGAT on the Cora dataset for 10 random seeds and report the average test accuracy in Fig.4. We can find that with a smaller number of labeled nodes,P-NGAT is more superior to other models in terms of classification accuracy. We attribute this improvement in accuracy to the fact that NGAT stacks more layers than other models,so the label information can be propagated over a larger neighborhood. Therefore, this experiment shows that deep GNNs perform better in scenarios with sparse labels. When the number of labeled nodes increases,all test models reach a similar performance level.
Fig. 4 Test accuracy averaged over 10 runs for different sizes of the training set (number of labeled nodes per class) on the Cora dataset. For smaller training set sizes, the performance of P-NGAT increases further
5.Model depth
In this experiment,we explore how the accuracy of node classification depends on the depth for different models. We add an ablation study to compare the performance when using only node-wise attention or layer-wise attention, and then empirically demonstrate NGAT’s capacity to alleviate the problem of oversmoothing. The baseline model is GCN. Node-wise GCN (NGCN) denotes an NGAT with only node-wise attention,while layer-wise GCN(LGCN) is a model with only layer-wise attention.We report the test accuracy averaged over 10 runs on five datasets: Cora, Citeseer, PubMed, Academic CS, and Acadamic Physics using Planetoid split (Yang et al., 2016). The results are summarized in Fig.5. For the datasets considered here,the best results of GCN and NGCN are both obtained with less than five layers. If the GCN has more than five layers, the classification accuracy drops rapidly,while the NGCN with more than 10 layers encounters accuracy decrease. It shows that node-wise attention can improve the accuracy,but cannot alleviate oversmoothing clearly. In addition, we can observe that compared to the results on the Cora and Citeseer datasets,the accuracy decrease on the PubMed,Academic CS, and Academic Physics datasets (whose number of nodes is 10 times that of Cora and Citeseer) is relatively small. For example, on the Cora dataset,the accuracy of the 25-layer GCN decreases by 29% compared to the accuracy of the 2-layer GCN, while the decrease is 9% on the Academic Physics dataset. If we use only layer-wise attention,depicted by LGCN,we find that it can maintain high classification accuracy even with deep layers, which shows the ability of overcoming the oversmoothing issue. It is worth noting that with both node-wise attention and layer-wise attention, P-NGAT can alleviate the problem of oversmoothing, and further improve the classification accuracy. Therefore, we can empirically conclude that even with deep layers, the node embeddings learned by P-NGAT are informative and distinguishable.
Fig. 5 Influence of model depth (number of layers) on node classification performance. The experimental results of P-NGAT (using self-attention layer-wise aggregation) are compared to those of a standard GCN model,an NGAT with only node-wise attention(NGCN),and an NGAT with only layer-wise attention(LGCN)on datasets Cora (a), Citeseer (b), PubMed (c), Academic CS (d), and Academic Physics (e). Markers denote test accuracy averaged over 10 runs
Based on the above experimental results, we conclude that NGAT can learn high-quality node embeddings in a semi-supervised manner. To obtain a deep understanding of the layer-wise attention mechanism, we train the 10-layer and 20-layer P-NGAT on the Cora dataset for comparison. The train/dev/test set is split according to the Planetoid split.
First, we present the classification results of 1000 test nodes using the t-SNE algorithm (van der Maaten and Hinton, 2008), as shown in Fig. 6.Numerically speaking, the 10-layer P-NGAT can achieve an 84.6% average accuracy, while the 20-layer counterpart shows 84.4% accuracy. It can be seen that there is no significant decrease in performance as the model goes deeper. The same result can be seen between the panels in Figs. 6a and 6b,and the class cluster visualization of the 20-layer PNGAT still has a clear distinction, which is close to the 10-layer version.
Fig. 6 Insight into NGAT performance on the Cora dataset. (a) and (b) are the t-SNE visualization of the classification of 1000 test nodes. Different colors mean different classes. Panel (a) is a 10-layer P-NGAT and panel (b) shows a 20-layer model. (c) depicts the layer importance of each hidden layer for representation.The numbers below the panel represent the index of the layer, and darker color means greater importance score
Second, we look for the preference of NGAT in different neighborhood depths. We define the concept of layer importance to characterize the contribution of the layers to the final node embeddings.
Definition 1(Layer importance) In multi-head self-attention settings,Civ∈RKis theithattention vector computed by the corresponding head for nodev(?v∈V). The layer importance LI(l)is defined as
whereMis the number of attention heads andCiv,lis thelthattention coefficient.
Fig. 6c visualizes the layer importance of each hidden layer in heatmap form. We find that the 10-layer P-NGAT emphasizes the node embeddings of the 3rdand 4thlayers, while the 20-layer P-NGAT emphasizes the layers from the 8thto 10thand from the 16thto 18th. Therefore, despite being a 10-layer or deeper architecture, NGAT pays sufficient attention to those shallow layers, and selectively fuses these features from all layers to generate a depthadaptive node embedding.
Lastly, we are interested in the nodes that are misclassified. We construct a six-layer P-NGAT,and present the quantitative statistics of truly and falsely classified nodes in Table 5. Table 5 shows that every truly classified nodevthas an average of 3.91 neighboring nodes. Among these neighbors, 89.4%nodes have the same label asvt, while the falsely classified nodes have smaller values in both the fields.We can conclude that the local structure and label sharing are beneficial to node embeddings. For each node in the graphs, more neighbors provide more mutual information, and more neighbors with the same label are helpful to generate a label-aware node embedding.
Table 5 Quantitative statistics of the test prediction results on the Cora dataset
In this paper, we have proposed a novel nested graph attention network (NGAT), which can learn high-quality node embeddings on graphs with very few labeled nodes. The intuitive motivation is to reduce the harm of oversmoothing,which renders node embeddings indistinguishably and hurts the classification accuracy as the number of network layers increases. NGAT improves both breadth and depth exploration, and allows the stacking of more layers within a feasible scope. Our work can be standardized as an extension of JK-Net (Xu et al., 2018).We have adopted a simple but powerful architecture for both neighborhood and layer aggregations.On many node classification benchmarks, NGAT achieves state-of-the-art performance compared to some novel GNNs. For future work, we aim to first explore ways such as sampling methods(Chen et al.,2018; Zou et al., 2019) to reduce the computational complexity of node embedding. Second, we try to apply NGAT to graph classification tasks.
Contributors
Jianke HU and Yin ZHANG designed the research.Jianke HU processed the data and drafted the paper. Yin ZHANG helped organize the paper. Jianke HU and Yin ZHANG revised and finalized the paper.
Compliance with ethics guidelines
Jianke HU and Yin ZHANG declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年3期