Chapter 2 Background

This section of the thesis will provide a brief overview on graphs and their applications to machine learning.

2.1 Preliminaries

A simple graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) is a data structure which defines the connections - or edges, \((u, v) \in \mathcal{E}\) - between a set of nodes \(\mathcal{V}\). A graph is often represented by its adjacency matrix \(A \in \mathbb{R}^{n\times n}\), where \(A_{i,j}\) is the weight of the connection between nodes \(i\) and \(j\). In weighted graphs, we have \(A_{i,j} \in \mathbb{R}\), whereas in unweighted graphs \(A_{i,j} \in \{0, 1\}\) to indicate the binary presence or absence of a connection. There are many variations and extensions to the simple graph structure. For instance, a heterogeneous graph - also frequently referred to as multi-relational or typed - is a graph where each node and edge can be assigned a particular type label. Closely related are \(k\)-partite networks, which are heterogeneous networks in which nodes of the same type are never connected, as well as the knowledge graph, wherein labelled edges are expressed as \((\mathrm{head}, \mathrm{relation}, \mathrm{tail})\) triples. A more complex graph structure is the multi-layer network \(\mathcal{G} = \{V, \{\mathcal{E}_0, \mathcal{E}_1, \ldots, \mathcal{E}_n\}\}\), in which one set of nodes is connected by multiple independent sets of edges. Heterogeneous and knowledge graphs can both be expressed as multi-layer networks by storing the edges for each relation as independent layers. A multi-layer network can be represented by an adjacency tensor \(\mathbf{A} = [A_0; A_1; \ldots; A_n] \in \mathbb{R}^{n\times n\times n}\). Finally, there are graphs in which an arbitrary number of nodes can be connected in a given edge. This is known as a hypergraph, and is said to be \(k\)-uniform if all edges have cardinality \(k\). A simple graph is then equivalent to a \(2\)-uniform hypergraph.

There are also a number of important algorithms which operate on graphs. For example, a random walk is the process of repeatedly and recursively sampling from a node’s neighbors.

Definitions of key terms.
Term	Notation	Definition	Example(s)	Additional Names
Node	\(v \in \mathcal{V}\)	A discrete object or entity.		Vertex, entity
Edge	\((u, v) \in \mathcal{E}\)	A connection between two nodes.
Graph	\(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\)	A set or tuple of nodes and edges.	Social network, academic citations, internet hyperlinks	Network
Multi-layer Network	\(\mathcal{G} = \{\mathcal{V}, \{\mathcal{E}_0, \ldots, \mathcal{E}_n \}\}\)	A set of \(n\) edge sets connecting a node set.	Temporal networks (e.g. monthly snapshots), protein interaction networks	Multiplex network
Adjacency Matrix	\(A \in \mathbb{R}^{\|V\| \times \|V\|}\)	A matrix representation of a graph, where \(A_{i,j} > 0\) if \((i, j) \in \mathcal{E}\).
Adjacency Tensor	\(\mathbf{A} \in \mathbb{R}^{\|V\| \times \|V\| \times \|V\|}\)	A tensor representation of a multi-layer network, where each axis is the adjacency matrix for a given layer.
Neighbors	\(u_i \in \mathcal{N}(v)\)	All nodes which are connected to node \(v\).
Random Walk	\([w_0, w_1, \ldots, w_\ell]\)	The process of randomly sampling nodes starting from \(w_0 \in \mathcal{V}\), such that \(w_i \in \mathcal{N}(w_{i-1})\).
Heterogeneous Graph	Same as multi-layer network.	A multi-layer network where each layer represents a different edge label. May optionally assign each node a label as well.	Academic network	Multi-relational network, labelled graph
Knowledge Graph	\(\{(h, r, t)\} \in (\mathcal{E} = \mathcal{V} \times \mathcal{R} \times \mathcal{V})\)	A heterogeneous graph, generally having a large \(n >> 1000\) number of edge labels.	Wikipedia

2.2 Applications of Machine Learning on Graphs

There are a number of tasks pertaining to graphs which machine learning models can be applied to. At their cores, each task on a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) fundamentally concerns a subset of its nodes - a node set - \(s = \{v_i\} \in \mathcal{V}^k\). In general, we want to find a function (i.e., machine learning model) \(f:\mathcal{V}^k \to \mathbb{R}^d\) such that \(\hat{y}_i = f(s_i)\) is corresponds to a ground-truth label \(y_i\), which is done by minimizing some cost function \(\mathcal{L}\) aligned to the goal of the task.

Graph learning tasks are often classification based, with the goal of mapping \(s\) to one of several class labels \(c_i \in \mathcal{C}\). Models are provided binary vectors \(y \in \{0, 1\}^{|\mathcal{C}|}\) as labels and produce probability scores \(\hat{y}\in \mathbb{R}^{|\mathcal{C}|}\) as output. The objective of single-label classification tasks is generally to find a model whose outputs \(\hat{y}\) minimize the cross entropy cost function \(\mathcal{L} = \sum_{i}^{|\mathcal{C}|}y_i\log(\hat{y}_i)\), or in the case \(|\mathcal{C}|=1\), the binary cross entropy \(\mathcal{L} = -(y \cdot \log(\hat{y}) + (1-y)\cdot \log(1-\hat{y}))\). If a node set can have more than one label, known as multilabel classification, the sum of binary cross entropy losses across all labels can be used. Alternatively, a graph learning task can be regression based, where we wish to map a node set to a single scalar value \(s \to \mathbb{R}\). Mean squared error is commonly used as a cost function for regression tasks, with \(\mathcal{L}=(y-\hat{y})^2\).

In all tasks, we optimize a machine learning model on a training dataset, and evaluate its performance on an inference dataset to ensure that the model is capable of generalizing to unseen data. This necessitates the partitioning of a given graph dataset into training and inference sets, which requires special consideration for graph learning tasks. In the scenario where a node may appear in both the training and inference sets, the task is said to be transductive in nature. In contrast, an inductive task is one in which the nodes that appear in the training set are disjoint from those which appear in the inference set. A more extreme inductive scenario could involve completely separate graphs, where the model is trained on one or more graphs and then performs inference on fully novel graphs. Broadly speaking, transductive tasks measure a model’s ability to perform a task across an observed set of nodes, while inductive tasks measure a model’s ability to perform a task across an unseen set of nodes.

This section will provide a brief, non-comprehensive overview of several node-level (\(k=1\)), set-level (\(1 < k< |\mathcal{V}|\)), and graph-level (\(k=|\mathcal{V}|\), i.e. \(s=\mathcal{V}\)) tasks alongside their respective categorizations (classification vs. regression, transductive vs. inductive, etc.) and evaluation metrics.

2.2.1 Node-level Tasks

A node-level task considers a single node in a graph at a time; for example, when we may wish to categorize a user in a social network, or to estimate the volume of traffic a web page will receive. The former case refers to node classification, wherein we try to find a function \(f: \mathcal{V} \to \mathcal{C}\) that maps a single given node to one of several class labels. We measure the quality of \(f\) using metrics such as accuracy and \(F_1\) score. The latter task is an example of node regression, where \(f\) must map a given node to a real-valued scalar and is evaluated by mean average error. In both tasks, the dataset is partitioned node-wise into training and inference nodes \(V_\mathrm{train}, V_\mathrm{test} \subset \mathcal{V}\). In a transductive node-level task, the graph’s full set of edges are available during both training and inference along with the nodes’ features \(X \in \mathbb{R}^{|\mathcal{V}|\times D}\) (should they exist). The only information withheld is the label of each inference node in \(V_\mathrm{test}\). In the inductive setting, however, these edges and features are also withheld, meaning that the model will be evaluated on nodes it has never seen before.

2.2.2 Set-level Tasks

Several tasks involve making predictions on sets of nodes with cardinality in \([2, |V|-1]\), most notably including link prediction. In link prediction, a model must identify edges that are missing from the graph. In the classical setting, edges consist of two nodes, and so a model must accept an edge, a set \(s \in \mathcal{V}^2\), and perform binary classification to determine whether the edge belongs in the graph. Hyperedge link prediction generalizes this to higher values of \(k\) in hypergraphs. Binary link prediction can be evaluated using any classification-oriented metrics such as accuracy, \(F_1\), precision, and recall. While most frequently treated as a binary classification task, link prediction can also be a regression task when a graph is weighted. In this case, a model attempts to map an edge to the correct real-valued weight. Link prediction can also be trained and evaluated in the ranking setting, where instead of classifying edges as valid or invalid, models must return a relative ordering over a set of edges such that the edges which do exist are ranked higher than those which don’t. In this settings, models are trained with a ranking cost function such as margin loss, \(\mathcal{L}=\max(0, m+ \hat{y}_+ - \hat{y}_\neg)\), where \(\hat{y}_+\) is a model’s prediction for a positive (valid) edge and \(\hat{y}_\neg\) is a model’s prediction for a negative (invalid) edge. Intuitively, this cost function enforces that predictions for positive edges should be higher than for negative edges. Models trained for ranking-based link predictions can be evaluated using ranking metrics such as mean reciprocal rank (MRR) and mean average precision (MAP), and in the case of a weighted graph, graded relevance measures such as normalized discounted cumulative gain (NDCG).

For transductive link prediction, a random sample of the edges are withheld to form \(e_\mathrm{test}\), and the remaining edges as well as any node features are available during training and inference. The inductive case is similar to node classification in that a subset of nodes, along with their edges and features, are once again held out during training. At inference, however, not all of the edges associated with \(V_\mathrm{test}\) are made available as they are during classification; only a subset of these nodes’ edges are provided, and the task of the model is to use these along with any given node features to identify the remaining held-out edges.

A closely related task specific to knowledge graphs is question answering, where a model accepts a query graph \(Q=\{s, \tilde{E}\}: s \in \mathcal{V}^k, \tilde{E} \subset \mathcal{E}\) representing a logical question, which is the induced subgraph of \(\mathcal{G}\) over \(s\). The goal is to find the entity which satisfies the query graph, and the task can be evaluated using link prediction metrics. There are additional set-level tasks such as community detection. These tasks are out of the scope of this thesis, however, and are therefore left for further reading.

2.2.3 Graph-level Tasks

Finally, a graph-level task introduces the scenario when one instance is constituted by an entire graph. These tasks are similar to node-level tasks, primarily taking the simple form of either classification or regression, with the main difference being that graph-level tasks are all inherently inductive to some extent, as each instance presents a novel graph (though nodes can of course appear in multiple graphs). Graph classification is the task of mapping a graph to a class label \(\mathcal{G} \to \mathcal{C}\), with real-world applications such as identifying a category for a group of connected documents on Wikipedia or detecting if a discussion thread on a social media platform contains any instances of hate speech. It follows that graph regression is the task of mapping a graph to a real-valued scalar \(\mathcal{G} \to \mathbb{R}\), such as predicting the toxicity level of a chemical compound.

2.3 Graph Representation Learning

Representation learning (Bengio, Courville, & Vincent, 2013) is the task of computing \(d\)-dimensional real-valued feature vectors (or embeddings) \(x \in \mathbb{R}^d\) for each point in a given dataset. These feature vectors are then used as inputs to a downstream machine learning model to solve a classification or regression task, or for visualization and analysis of a dataset. In the context of graphs, this often means learning feature vectors for each node. The rise of neural-network based approaches to representation learning (Krizhevsky, Sutskever, & Hinton, 2012; Mikolov, Chen, Corrado, & Dean, 2013; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) made it possible to obtain high-quality feature vectors for large language and image datasets. This section will discuss how those approaches were extended to learn representations for graph-based data.

2.3.1 Random Walk and Factorization-based Methods

A common approach to learning word representations is by sampling sentences from a large corpus and optimizing the representation of each word to be closest to that of the words it most commonly co-occurs with. Perozzi et al. (2014) connected this process to performing random walks on graphs, resulting in the DeepWalk node embedding model. For a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) DeepWalk learns a node embedding function \(\Phi: \mathcal{V} \to \mathbb{R}^d\) by drawing a corpus \(\mathcal{C} = \{w_i\}\) of random walks, where \(w_i = [v_0, v_1, \ldots, v_n: v_i \in \mathcal{V}]\). The parameters of \(\Phi\) are optimized to maximize the co-occurrence probability of nodes which appear in close proximity in the same random walk(s) using the SkimGram method (Mikolov, Chen, et al., 2013). Each walk is split into \(2w+1\)-length windows, and we minimize the negative log likelihood of the first and last \(w\) nodes given the embedding of the middle node:

\[ \min_{\Phi}: \: -\log\mathrm{Pr}\left(\{ v_{i-w}, \ldots, v_{i-1}, v_{i+1}, v_{i+w} \} | \Phi(v_i) \right) \]

A number of related approaches quickly followed, most notably Node2Vec (Grover & Leskovec, 2016), which generalized the random walk extraction approach proposed in DeepWalk to allow configuration of depth versus breadth and restart probability in each walk. A concurrent line of work examined factorization of the adjacency matrix and its powers as an avenue to obtain node embeddings. LINE (Tang, Qu, Wang, et al., 2015) is a highly efficient factorization-based approach which preserves both first and second-order proximity in its resulting embedding space by decomposing. This was extended to the heterogeneous setting by PTE (Tang, Qu, & Mei, 2015), which jointly optimizes the embeddings of word-word, word-document, and word-label bipartite networks to produce content-aware node embeddings. In fact, factorization-based approaches are particularly well suited to highly multi-relational networks and schema-rich knowledge graphs, whose adjacency tensors can be decomposed to obtain knowledge graph embeddings (KGE) (Balazevic, Allen, & Hospedales, 2019; Bordes, Usunier, García-Durán, Weston, & Yakhnenko, 2013; Kazemi & Poole, 2018; Nickel, Tresp, & Kriegel, 2011; Trouillon et al., 2017).

Interestingly, connections have been identified between both lines of work. NetMF (Qiu et al., 2018; Xie et al., 2021) established equivalencies between DeepWalk, LINE, PTE, and node2vec under the unified umbrella of matrix factorization. In addition, the recently proposed ReFactor-GNN (Chen et al., 2022) demonstrated how factorization-based methods could be recast as message-passing architectures, the subject of the following section.

2.3.2 Graph Neural Networks

2.3.2.1 Neural Message Passing

The above random walk and matrix factorization-based approaches integrate multi-layer neural networks with graph-theoretic objective functions in order to produce node embeddings. While these methods are able to preserve certain properties of the underlying graph, they lack the strong inductive bias found in neural network architectures specialized for other domains, such as Convolutional Neural Networks for image processing. In contrast, Graph Neural Networks (GNNs) are generally based on the message passing framework (Gilmer, Schoenholz, Riley, Vinyals, & Dahl, 2017), in which nodes’ representations are computed as a function of their neighbors’ representations. As explained by (Bronstein, Bruna, Cohen, & Veličković, 2021; Veličković, 2022), for a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) with node representations \(\mathbf{X} \in \mathbb{R}^{\mathcal{V} \times d}\), the message passing procedure in GNNs can be summarized as a sequence of three steps:

Message: A function \(\psi: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^k\) which computes a message between two nodes as a function of their representations;
Aggregate: A permutation invariant function \(\bigoplus: \mathbb{R}^{s \times k} \to \mathbb{R}^l\) which aggregates a set of vector-valued messages;
Update: A function \(\phi: \mathbb{R}^k \times \mathbb{R}^l \to \mathbb{R}^m\) which updates a node’s representation as a function of its current representation and the aggregated messages of its neighbors.

Several rounds of message passing can be performed consecutively to propagate node representations across multiple hops, increasing a GNN’s receptive field. In this case, we compute the embedding \(\mathbf{h}_v\) of a node \(v\) at the \(i^{th}\) round of message passing as:

\[ \mathbf{h}^{(i)}_{v} = \phi \left( \mathbf{h}^{(i-1)}_{v}, \bigoplus_{u \in \mathcal{N}(v)} \psi(\mathbf{h}^{(i-1)}_{v}, \mathbf{h}^{(i-1)}_{u}) \right) \]

The resulting node representations are then used to optimize an objective function for a downstream task as described above. In node classification, for example, a linear map \(\rho: \mathbb{R}^m \to \mathbb{R}^c\) (where \(c\) is the number of class labels) is applied to each node’s embedding \(\mathbf{h}_v\) to predict an output label.

2.3.2.2 Convolutional architectures

There are a number of instantiations of the message passing framework described above. An instance of a message passing model is considered convolutional if its message passing function satisfies \(\psi(\mathbf{h}_v, \mathbf{h}_u) = c_{v,u}\Psi(\mathbf{h}_v)\) where \(c_{v,u}\) is a fixed constant specific to the edge between nodes \(v\) and \(u\) (e.g., an edge weight). Kipf & Welling (2017) introduced the most notable instance of a convolutional message passing model, the Graph Convolutional Network (GCN). The GCN used \(\psi(\mathbf{h}_v, \mathbf{h}_u) =\frac{1}{|\mathcal{N}(v)}\mathbf{h}_u\) and \(\bigoplus = \Sigma\); intuitively, this corresponds to taking the mean of neighboring nodes’ representations. A learnable weight matrix \(\mathbf{W}\) was used to parameterize \(\phi\). This approach showed strong performance on semi-supervised node classification as well as link prediction (Kipf & Welling, 2016). Subsequent efforts improved the efficiency of the GCN model by simplifying the model architecture to consist almost exclusively of message passing, eliminating non-linearities and collapsing model parameters (X. He et al., 2020; Wu et al., 2019). Additional follow-up work (Schlichtkrull et al., 2018) showed that the message-passing paradigm extends naturally to the multi-relational setting by applying a GCN to each relation type. Of course, this becomes quite inefficient with a large number of relations, making it necessary in most cases to apply basis decomposition in order to effectively share parameters between different relations. To leverage existing work on knowledge graph embedding and obtain embeddings for relations in addition to nodes, CompGCN (Vashishth, Sanyal, Nitin, & Talukdar, 2020) applied KGE scoring functions (Bordes et al., 2013; Nickel, Rosasco, & Poggio, 2016; Yang, Yih, He, Gao, & Deng, 2015) to improve the quality of the output embeddings. This may, however, be unnecessary; Degraeve, Vandewiele, Ongenae, & Hoecke (2022) show that the inductive bias provided by message passing may be enough on its own, even when using completely random embeddings and untrained model weights.

2.3.2.3 Attentional architectures

Another class of message passing models are attentional models, which occur when \(\psi(\mathbf{h}_v, \mathbf{h}_u) = \alpha({\mathbf{h}_v, \mathbf{h}_u})\Psi(\mathbf{h}_v)\), where \(\alpha: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\) is a learnable function which maps a pair of node representations to a scalar weight, also known as an attention mechanism (Bahdanau, Cho, & Bengio, 2015; Vaswani et al., 2017). The output of the attention mechanism is a real-valued scalar signifying the importance of node \(u\) to node \(v\), and subsequently the weight of node \(u\)’s representation in computing node \(v\)’s representation. When using the sum operator as the aggregation function, this corresponds to taking a weighted average over neighboring nodes’ representations. This is the approach taken by the Graph Attention Network (GAT) (Velickovic et al., 2018), which computes importance weights as \(\alpha(\mathbf{h}_v, \mathbf{h}_u) = \mathrm{softmax}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_v; \mathbf{W}\mathbf{h}_u])\) where \(\mathbf{W} \in \mathbb{R}^{k \times d}, \mathbf{a} \in \mathbb{R}^{2k}\) are learnable parameters.

2.3.3 Expressiveness of Graph Neural Networks

Despite their empirically strong results, message passing architectures are inherently limited by their connections to the Weisfeiler-Lehman (WL) algorithm (Leman & Weisfeiler, 1968). The WL algorithm assigns a color to each node in a graph, which can be used to test for an isomorphism between two graphs by deriving a mapping between the nodes in both graphs based on matching colors (known as the WL test). For a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\), each node \(v \in \mathcal{V}\) is assigned an initial color \(c_v^{(0)} \in \mathcal{C}\). This color is iteratively updated over \(k\) steps (or until convergence) as \(c_v^{(i)} = \mathrm{hash}\left((L_v^{(i-1)}, c_v^{(i-1)})\right)\), where \(L_v^{(i-1)} = \{c_u^{(i-1)}: u \in \mathcal{N}(v) \}\) and \(\mathrm{hash}: \mathcal{C}^k \times \mathcal{C} \to \mathbb{Z}\) is an injective hashing function. Intuitively, a node’s color captures its \(k\)-hop neighborhood, and two nodes with the same \(k\)-hop neighborhood structure will receive the same colors. We can test for isomorphism between two graphs by applying the WL algorithm to each graph and comparing their color histograms; if they differ (i.e., one graph has a different number of nodes of a given color than the other), then we can reject the possibility of an isomorphism. Otherwise, it is possible that the graphs are isomorphic, but cannot be confirmed as the graphs may only be isomorphic up to \(k\) iterations.

Note that the WL algorithm fits our definition of message passing with \(\phi = \mathrm{hash}\), \(\bigoplus = \mathrm{multiset}\) (a function which returns its arguments as a multiset), and \(\psi(\mathbf{h}_v, \mathbf{h}_u) = \mathbf{h}_u\). This indicates that message passing GNNs (MP-GNNs) are only as powerful as the WL test; that is, if the WL test is unable to distinguish a pair of non-isomorphic graphs, neither will an MP-GNN. This means that if two non-isomorphic nodes share similar local \(k\)-hop neighborhoods (thus indistinguishable by \(k\) iterations of the WL test), then they will receive identical representations from MP-GNNs, making certain tasks like node classification impossible when the two nodes belong to different classes. Unfortunately, the limitations of the WL test are well-documented (Arvind, Köbler, Rattan, & Verbitsky, 2015; Kiefer, Schweitzer, & Selman, 2015), which has led to stream of research examining to what extent MP-GNNs share the same limitations (Feng, Chen, Li, Sarkar, & Zhang, 2022; Morris, Lipman, et al., 2021; Morris, Fey, & Kriege, 2021; Xu, Hu, Leskovec, & Jegelka, 2019) and ways in which they can be overcome (Balcilar et al., 2021; Maron, Ben-Hamu, Serviansky, & Lipman, 2019; L. Zhao, Jin, Akoglu, & Shah, 2022), which can be grouped into several classes of approaches as follows.

2.3.3.0.1 Feature augmentations

In general, surpassing the limits of the WL test is done by augmenting nodes with additional information so as to distinguish their representations when message passing alone cannot. A simple approach is to append one-hot encodings of each node’s unique ID to their embeddings, which will guarantee the distinguishability of all nodes; however, this technique is inherently unable to generalize to unseen nodes or new graphs with more nodes than seen during training. We can instead heuristics based on a node’s degree (C. Ying et al., 2021) or PageRank score (Postavaru et al., 2020) to derive uniquely identifying features for each node. In fact, even randomly generating additional features for each node has shown to be a surprisingly effective method for boosting the expressiveness of GNNs (Abboud, Ceylan, Grohe, & Lukasiewicz, 2020; Degraeve et al., 2022; Egressy & Wattenhofer, 2022; Sato, Yamada, & Kashima, 2020).

2.3.3.0.2 Positional encodings

Another approach is to imbue each node’s representation with an indication of its global position within its graph, also known as a positional encoding (PE) (You, Ying, & Leskovec, 2019). PEs can help differentiate nodes which share similar local structures while residing in distant locations in a graph, and conversely, help capture proximities between nearby nodes with differing local structures. Dwivedi, Luu, Laurent, Bengio, & Bresson (2022) propose two positional encoding schemes, Random Walk PE (RWPE) and Laplacian PE (LapPE). For node \(i\), we obtain \(\mathrm{RWPE}(i) = [\mathbf{RW}_{i,i}, \mathbf{RW}^2_{i,i}, \ldots, \mathbf{RW}^k_{i,i}]\) where \(\mathbf{RW}^{l} = (AD^{-1})^l\) is the \(l\)-step transition matrix, meaning each dimension \(\mathrm{RWPE}(i)_j\) represents the probability that node \(i\) will end on itself along a \(j\)-step random walk. Similarly, \(\mathrm{LapPE}(i) = [\mathbf{U}_{i,0,}, \mathbf{U}^2_{i,1}, \ldots, \mathbf{U}^k_{i,k}]\) where \(U \in \mathbb{R}^{|\mathcal{V}| \times k}\) are the \(k\) smallest non-trivial eigenvectors of the graph Laplacian. However, this approach is limited by the sign ambiguity of eigenvectors which leads to \(2^k\) possible signs for \(k\) eigenvectors, requiring the signs to be randomly flipped during training (Dwivedi et al., 2020; Kreuzer, Beaini, Hamilton, Létourneau, & Tossou, 2021). Additionally, many of the random walk and factorization-based methods discussed in Section 2.3.1 such as DeepWalk and LINE can be considered positional encodings as they too capture node proximities. One issue is that the absolute values of these methods may not be stable across graphs, as an independent model must be fit on each graph. H. Wang, Yin, Zhang, & Li (2022) propose the Position Equivariant Graph Neural Network (PEG) as a solution; by updating node features and positional encodings separately, they maintain permutation equivariance and improve the stability of node embeddings.

The vast array of approaches described above were comprehensively evaluated by Rampášek et al. (2022), who devised a highly general framework - GraphGPS - for graph learning including positional and structural (discussed in Section 2.4.2.1.2) encodings, message passing layers, and global attention mechanisms. In GraphGPS, a combination of positional and structural encodings are used to obtain node features \(\mathbf{X}\) and edge features \(\mathbf{E}\). Each layer of GraphGPS uses a message passing layer \(\mathbf{X}_M = \phi(\mathbf{X}, \mathbf{E}, \mathbf{A})\) to propagate the node and edge features within local neighborhoods, a global attention layer \(\mathbf{X}_T = \alpha(\mathbf{X})\) to incorporate information from across the whole graph, and finally, the local and global embeddings are aggregated by a multi-layer neural network \(\psi(\mathbf{X}_M + \mathbf{X}_T)\). This framework is not only provably expressive and scalable, it is also highly modular, allowing the various components of the model (such as choice of positional and structural encoding or message passing layer) to be substituted depending on the needs of the task at hand. The authors carefully examined each component of their framework, observing that while global attention can be helpful, the choices of positional and structural encodings and message passing layer were far more important to improving model performance. This work is highly relevant to the present thesis from the perspective of designing and comprehensively evaluating a population of models within a standardized framework for graph learning tasks. However, GraphGPS presented a more general framework aiming for applicability to all graph learning tasks, whereas this work focuses solely on building generalizable, inductive models for the link prediction task. In the following section, we will briefly review other works which examine populations of models, also known as design spaces.

2.3.4 Design Space of Graph Neural Networks

With such a wide range of methods for learning graph representations, it is important to evaluate which methods are best suited for various applications. Examining populations of models in this way is also known as profiling a design space (Radosavovic, Johnson, Xie, Lo, & Dollár, 2019). The first large-scale study on design spaces for Graph Neural Networks was contributed by GraphGym (You et al., 2020), which proposed a number of fundamental “design dimensions” for GNNs, including:

Batch normalization
Dropout
Non-linear activation
Agggregation function
Number of message passing layers
Layer connectivity
Pre-processing layers
Post-processing layers
Batch size
Learning rate
Optimizer
Training epochs

With multiple choices for each dimension, there are over 10 million possible model configurations in this design space, making it unrealistic to evaluate all possible models. To make this tractable, the authors devised the Controlled Random Search (CRS) experimental procedure which uses a randomized block design to evaluate each dimension. For example, consider a study wishing to find the optimal batch size \(\mathrm{BatchSize} \in \{32, 64, 128\}\) within the design space \(\{\mathrm{BatchNorm}: [\mathrm{True, False}], \mathrm{LR}: [0.01, 0.001, 0.0001]\}\). We would sample \(|S|\) configurations from the cartesian product of the free dimensions \(S = \mathrm{BatchNorm} \times \mathrm{LR}\), and then take the cartesian product with the fixed dimension \(S \times \mathrm{BatchSize}\). This greatly reduces the overall quantity of experiments that must be run to identify the optimal batch size while simultaneously providing a consistent evaluation across all choices for the dimension of interest. This led to a number of interesting findings such as batch normalization being more helpful than dropout, \(\mathrm{PReLU}\) being highly effective as a non-linear activation function, and a batch size of \(32\) being preferable to either \(16\) or \(64\). It also confirmed several empirically held beliefs, such as \(\mathrm{sum}\) being the best aggregation function, Adam being generally superior to SGD, and more training epochs lead to better performance.

The methodology proposed by GraphGym was extended to the dynamic graph setting by ROLAND (You, Du, & Leskovec, 2022), who used CRS to conclude that batch normalization, skip-connectivity, and max aggregation were optimal design choices for their proposed architecture, and to the heterogenous setting by T. Zhao et al. (2022), who found that their results on batch normalization, number of message passing layers, optimizer, and training epochs were similar to those of GraphGym. They also found that heterogenous GNNs had several important differences, such as Dropout being a prerequisite of good performance, which the authors suggest is due to over-parameterization created by the node and edge-type specific model weights. Finally, a particularly relevant study was conducted by Z. Wang, Zhao, & Shi (2022), who profiled the design space of models for the collaborative filtering (CF) task. Through their evaluation, the authors identified a “pruned” designed space which was shown to contain a higher concentration of well-performing models, an important contribution for both future research and applications as it provides a significantly improved “starting point” for researchers and practitioners designing CF models. This work is of great interest to the present thesis as it studies the design space of collaborative filtering, which can be formulated as a link prediction problem. However, the study also examines non-GRL models specific to CF whereas we focus solely on GRL-based models. Their study also focuses solely on performance in the transductive setting, as their generalization analysis examines how well their reduced design space transfers to new datasets, not how actual trained models can be applied to new datasets. This thesis goes beyond the transductive setting to investigate the applicability of existing trained models to completely novel datasets, evaluating their effectiveness in the inductive link prediction task where all nodes are previously unseen.

2.4 Link Prediction

In this section, we will review the existing body of literature concerning the application of deep learning methods to the link prediction task. We will first cover the earliest deep learning-based approaches to link prediction, which were followed by the emergence of Graph Neural Networks (GNN). We will then discuss the challenge of inductive link prediction, including the approaches and datasets contributed to its line of research.

2.4.1 Learning Representations for Link Prediction

2.4.1.0.1 Embedding-based approaches

Link prediction is a task particularly well-suited for supervised learning methods - especially deep learning based-approaches - given that ground-truth labels can automatically be obtained from any graph by sampling from the edges. One of the first works to demonstrate this was Node2Vec (Grover & Leskovec, 2016), which compared established heuristic approaches to deep learning approaches. Node2Vec, DeepWalk (Perozzi et al., 2014), and LINE (Tang, Qu, Wang, et al., 2015) were used to learn embeddings \(f: \mathcal{V} \to \mathbb{R}^D\) for a given graph. Each link \((u, v) \in \mathcal{V}\) to be classified was scored as \(f \circ g = g(f(u), f(v))\), where \(g\) was a binary operator mapping the two embeddings to a fixed-size vector, such as taking the average of the Hadamard product. In parallel to these works came a number of deep learning approaches to the task of knowledge graph completion (KGC). Knowledge graph completion is an instance of link prediction specific to knowledge graphs, where each edge is a directed triple \((h, r, t)\) consisting of a head node (node), relation (typed edge), and tail node. Optimizing the embeddings of nodes and relations such that linked nodes have similar embeddings proved to be a highly effective approach. The TransE (Bordes et al., 2013) model optimizes embeddings such that \(\mathbf{h} + \mathbf{r} \approx \mathbf{t}\), while RotatE (Sun, Deng, Nie, & Tang, 2018) uses the objective \(\mathbf{h} \circ \mathbf{r} \approx \mathbf{t}\).

2.4.1.0.2 Message passing approaches

The learning-based approaches demonstrated their efficacy by outperforming the heuristics by sizeable margins, a trend further demonstrated alongside the development of Message Passing Neural Networks (MPNNs). The Variational Graph Auto-Encoder (VGAE) (Kipf & Welling, 2016) used a Graph Convolutional Network (GCN) as the encoder and a dot-product encoder to extend the Variational Auto-Encoder to the graph domain, resulting in significant performance improvements relative to DeepWalk. These approaches were extended to the knowledge graph completion task as well, with methods such as CompGCN (Vashishth et al., 2020) combining message passing networks and KGC scoring models. The strong performance of MPNNs for link prediction has also spurred improvements in related tasks. For instance, many recommender system problems can be formulated as link prediction, allowing various MPNNs such as the VGAE (Berg, Kipf, & Welling, 2017), GraphSAGE (R. Ying et al., 2018), and Graph Attention Networks (X. Wang, He, Cao, Liu, & Chua, 2019) to be applied to tasks such as collaborative filtering.

2.4.2 Inductive Link Prediction

2.4.2.1 Methods

A common challenge for link prediction methods is the ability to handle unseen data. Many of the aforementioned methods are transductive, storing a unique set of parameters for each node. When unseen nodes are present in new data, these methods will perform poorly as their representations of these unseen nodes have not been optimized. In constrast, some MPNNs are inductive, meaning that they operate directly upon the attributes of a given node. However, it is not always reasonable to assume the presence of node attributes in realistic scenarios; for example, nodes in knowledge graphs rarely have a standardized, reliable set of properties, and user-derived graphs (such as co-purchase networks, social media networks, and communication networks) may mask node attributes for privacy protection. This has led to fully inductive link prediction methods which learn solely from the structure of the graph without depending on the presence of node attributes. There have been a number of approaches toward handling unseen nodes in link prediction tasks, which broadly fall under two main categories.

2.4.2.1.1 Neighborhood aggregation

The first class of approaches assumes that while a node may be unseen at inference, its neighbors could have been seen during training, making it possible to exploit Message-Passing Neural Networks to fill in unseen node’ embeddings. Hamaguchi, Oiwa, Shimbo, & Matsumoto (2017) propose a method for knowledge graph completion that assigns a unique embedding to each node which is then updated using message-passing and aggregation. In contrast to previous KGC methods, this ensures that even when an node is unseen (meaning its initial embedding is unoptimized and effectively random), its final embedding will be supported by the optimized embeddings of its neighbors. The model is trained end-to-end using TransE (Bordes et al., 2013) to score triples, demonstrating consistently strong performance on different partitions of the WordNet knowledge base even when increasing the number of unseen nodes at inference. This idea was extended by the Logic Attention Network (LAN) (P. Wang, Han, Li, & Pan, 2019), which satisfies a set of key requirements for the aggregation step, including permutation invariance, redundancy awareness, and relational awareness. LAN applies relation-specific transformations and adapatively aggregates each node’s neighborhood by utilizing the attention mechanism (Bahdanau et al., 2015) to weigh to each neighboring node, and a logic rule mechanism to weigh the importance of the relation between each node and its neighbors. When compared to the mean-based approach proposed by Hamaguchi et al. (2017) using the same datasets, LAN showed significantly improved performance on inductive link prediction. This is similar to the approach taken by the Virtual Neighbor (VN) Network (Y. He, Wang, Zhang, Tu, & Ren, 2020), which offers an improvement to the logic rule mechanism by using a knowledge graph rule-mining tool to assign a soft score to reasoning paths extracted from random walks originating at each unseen node. After applying a threshold-based filter, the remaining paths are used to add new “virtual” triples to the knowledge graph between the original unseen node and the final node along the path. This approach yields superior performance to the mean-based approach as well as LAN when evaluated on datasets extracted from Freebase and YAGO.

2.4.2.1.2 Structural encodings

The second class of methods for inductive link prediction are known as structural encodings. In contrast to the first approach, SE are built under the assumption that the inference graph will be novel, meaning that each method must generalize to completely new - and possibly unattributed - graphs. In order to do this, a model must be able to learn solely from the structure of the graph without using any node-specific information. These approaches resemble the positional encodings discussed in Section 2.3.3.0.2, but instead of deriving node-specific distinguishing features, structural encodings compute edge-specific features, making them particularly well suited for the link prediction task. The first work to tackle this problem using deep learning was the Weisfeiler-Lehman Neural Machine for Link Prediction (WLNM) (Zhang & Chen, 2017), which proposed a novel approach to link prediction by extracting subgraphs around each target link. Each node in a subgraph was assigned a color using a variation of the Weisfeiler-Lehman algorithm (Leman & Weisfeiler, 1968) that initialized each node’s color as a function of its mean geometric distance to the target link. These colors were used as a sorting key to impose an order on the vertices for the construction of an adjacency matrix, which was used as input to a neural network scoring function that predicted the validity of the target link. This results in each training (and testing) instance being its own novel graph, allowing the method to learn solely from the extracted structures and easily generalize to unseen graphs.

This work was fortified by SEAL (Zhang & Chen, 2018), which uses a Graph Neural Network in place of a fully-connected Neural Network to best incorporate the full breadth of information provided by a given graph, including the adjacency structure and potential (but still optional) node attributes. SEAL simplifies the WLNM’s labeling procedure by introducing Double-Radius Node Labeling (DRNL), which issues progressively larger labels to nodes more distant from the target link. A node’s embedding is then computed as the one-hot encoding of its label, which is optionally concatenated to its attributes if available. The embeddings and structure of each subgraph is then used as input to a Subgraph Neural Network, which propagates the embeddings across the subgraph, pools the embeddings into a final fixed-size subgraph encoding, and classifies the subgraph as positive or negative. SEAL outperformed a diverse set of methods including heuristics, factorization, and message passing networks, setting a new state-of-the-art for link prediction. SEAL was quickly adapted to other domains in which inductive learning is necessary, such as collaborative filtering (Zhang & Chen, 2020) and knowledge graph completion (Teru, Denis, & Hamilton, 2020). This class of node labeling approaches was examined in-depth by Zhang, Li, Xia, Wang, & Jin (2020), who coined the term “labeling trick” to describe these approaches. Their work derives important theoretical results proving that the labeling trick enables any sufficiently expressive GNN to learn maximally expressive multi-node representations for set-level tasks such as link prediction. They verify their results empirically, showing that labeling trick methods outperform heuristics and graph auto-encoders on a number of benchmark datasets.

2.4.2.2 Applications

While the methods described above bear the ability to perform inductive link prediction, little attention has been paid to their application in realistic scenarios. This has led to several works examining the generalization ability of these inductive models across unseen nodes, and even unseen graphs. The most prevalent of these works have focused on knowledge graph completion. For instance, Sadeghi, Malik, Collarana, & Lehmann (2021) designed a benchmark of 32 datasets spanning multiple relational patterns in both transductive and inductive settings. Using the Freebase and Wordnet knowledge bases, the authors extract triples matching specific relational patterns such as symmetry (a relation which is valid even when its head and tail are reversed) and anti-symmetry, inversion (a relation \(r\) which can be uniquely mapped to another relation \(r'\) such that \((h, r, t) \implies (t, r', h)\), such as Parent-Child and Teacher-Student), composition (a relation which depends on the existence of another relation, such as how Aunt depends on Sister and Child), and inference (a relation such which can be mapped to other relations \(r'_i\) such that \(\forall h, t : (h, r, t) \implies (h, r_i, t)\). The authors split the extracted triples into training and inference sets following one of several partitioning strategies, including fully transductive and inductive partitions, head-tail ratio inductivity (in which either the head or tail of each triple in the training set is held out of the inference set), and percentage-wise building (in which half of the inference triples are inductive and half are transductive). Training a number of well-established knowledge graph completion models on these datasets revealed clear challenges in the inductive settings, as models tailed for inductive learning outperformed others by wide margins across all relational patterns. This dataset-construction strategy was scaled up by Galkin, Berrendorf, & Hoyt (2022), who shared the largest existing dataset for inductive link prediction by sampling and inductively partitioning WikiData. By sampling hold-out nodes for the inference graph, the authors obtain two datasets (ILPC22-Small and ILPC22-Large) with disjoint node sets, facilitating the evaluation of models in a setting where completely unseen nodes are present at inference.

These works demonstrate that certain inductive models can generalize to previously unseen nodes, but what about entirely new graphs? Zhang & Chen (2020) showed that Inductive Graph Matrix Completion (IGMC), an extension of SEAL, could obtain competitive performance on a given user-product graph even when trained on a different graph from a potentially completely different domain (e.g., training on a user-movie graph and transferring to a user-song graph), without any fine-tuning or transfer learning on the target graph. A similar study was performed by H. Wang et al. (2022), which proposed the “domain-shift link prediction” task. The authors examine the in-domain generalization ability of a number of methods by training each on one graph and testing it on a related graph (e.g., training and testing on two different citation networks, such as Cora \(\to\) CiteSeer (Sen et al., 2008)). They find that SEAL and PEG, their proposed method discussed in Section 2.3.3.0.2, are easily able to generalize to new graphs. While these studies find that certain methods are able to generalize to unseen data, little consideration is given to what makes this possible. In contrast, this thesis provides a comprehensive examination across the various components of inductive link prediction methods to share insight into what design decisions facilitate or hamper generalization.