Chapter 2 Background

This section of the thesis will provide a brief overview on graphs and their applications to machine learning.

2.1 Preliminaries

A simple graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) is a data structure which defines the connections - or edges, \((u, v) \in \mathcal{E}\) - between a set of nodes \(\mathcal{V}\). A graph is often represented by its adjacency matrix \(A \in \mathbb{R}^{n\times n}\), where \(A_{i,j}\) is the weight of the connection between nodes \(i\) and \(j\). In weighted graphs, we have \(A_{i,j} \in \mathbb{R}\), whereas in unweighted graphs \(A_{i,j} \in \{0, 1\}\) to indicate the binary presence or absence of a connection. There are many variations and extensions to the simple graph structure. For instance, a heterogeneous graph - also frequently referred to as multi-relational or typed - is a graph where each node and edge can be assigned a particular type label. Closely related are \(k\)-partite networks, which are heterogeneous networks in which nodes of the same type are never connected, as well as the knowledge graph, wherein labelled edges are expressed as \((\mathrm{head}, \mathrm{relation}, \mathrm{tail})\) triples. A more complex graph structure is the multi-layer network \(\mathcal{G} = \{V, \{\mathcal{E}_0, \mathcal{E}_1, \ldots, \mathcal{E}_n\}\}\), in which one set of nodes is connected by multiple independent sets of edges. Heterogeneous and knowledge graphs can both be expressed as multi-layer networks by storing the edges for each relation as independent layers. A multi-layer network can be represented by an adjacency tensor \(\mathbf{A} = [A_0; A_1; \ldots; A_n] \in \mathbb{R}^{n\times n\times n}\). Finally, there are graphs in which an arbitrary number of nodes can be connected in a given edge. This is known as a hypergraph, and is said to be \(k\)-uniform if all edges have cardinality \(k\). A simple graph is then equivalent to a \(2\)-uniform hypergraph.

There are also a number of important algorithms which operate on graphs. For example, a random walk is the process of repeatedly and recursively sampling from a node’s neighbors.

Definitions of key terms.
Term Notation Definition Example(s) Additional Names
Node \(v \in \mathcal{V}\) A discrete object or entity. Vertex, entity
Edge \((u, v) \in \mathcal{E}\) A connection between two nodes.
Graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) A set or tuple of nodes and edges. Social network, academic citations, internet hyperlinks Network
Multi-layer Network \(\mathcal{G} = \{\mathcal{V}, \{\mathcal{E}_0, \ldots, \mathcal{E}_n \}\}\) A set of \(n\) edge sets connecting a node set. Temporal networks (e.g. monthly snapshots), protein interaction networks Multiplex network
Adjacency Matrix \(A \in \mathbb{R}^{|V| \times |V|}\) A matrix representation of a graph, where \(A_{i,j} > 0\) if \((i, j) \in \mathcal{E}\).
Adjacency Tensor \(\mathbf{A} \in \mathbb{R}^{|V| \times |V| \times |V|}\) A tensor representation of a multi-layer network, where each axis is the adjacency matrix for a given layer.
Neighbors \(u_i \in \mathcal{N}(v)\) All nodes which are connected to node \(v\).
Random Walk \([w_0, w_1, \ldots, w_\ell]\) The process of randomly sampling nodes starting from \(w_0 \in \mathcal{V}\), such that \(w_i \in \mathcal{N}(w_{i-1})\).
Heterogeneous Graph Same as multi-layer network. A multi-layer network where each layer represents a different edge label. May optionally assign each node a label as well. Academic network Multi-relational network, labelled graph
Knowledge Graph \(\{(h, r, t)\} \in (\mathcal{E} = \mathcal{V} \times \mathcal{R} \times \mathcal{V})\) A heterogeneous graph, generally having a large \(n >> 1000\) number of edge labels. Wikipedia

2.2 Applications of Machine Learning on Graphs

There are a number of tasks pertaining to graphs which machine learning models can be applied to. At their cores, each task on a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) fundamentally concerns a subset of its nodes - a node set - \(s = \{v_i\} \in \mathcal{V}^k\). In general, we want to find a function (i.e., machine learning model) \(f:\mathcal{V}^k \to \mathbb{R}^d\) such that \(\hat{y}_i = f(s_i)\) is corresponds to a ground-truth label \(y_i\), which is done by minimizing some cost function \(\mathcal{L}\) aligned to the goal of the task.

Graph learning tasks are often classification based, with the goal of mapping \(s\) to one of several class labels \(c_i \in \mathcal{C}\). Models are provided binary vectors \(y \in \{0, 1\}^{|\mathcal{C}|}\) as labels and produce probability scores \(\hat{y}\in \mathbb{R}^{|\mathcal{C}|}\) as output. The objective of single-label classification tasks is generally to find a model whose outputs \(\hat{y}\) minimize the cross entropy cost function \(\mathcal{L} = \sum_{i}^{|\mathcal{C}|}y_i\log(\hat{y}_i)\), or in the case \(|\mathcal{C}|=1\), the binary cross entropy \(\mathcal{L} = -(y \cdot \log(\hat{y}) + (1-y)\cdot \log(1-\hat{y}))\). If a node set can have more than one label, known as multilabel classification, the sum of binary cross entropy losses across all labels can be used. Alternatively, a graph learning task can be regression based, where we wish to map a node set to a single scalar value \(s \to \mathbb{R}\). Mean squared error is commonly used as a cost function for regression tasks, with \(\mathcal{L}=(y-\hat{y})^2\).

In all tasks, we optimize a machine learning model on a training dataset, and evaluate its performance on an inference dataset to ensure that the model is capable of generalizing to unseen data. This necessitates the partitioning of a given graph dataset into training and inference sets, which requires special consideration for graph learning tasks. In the scenario where a node may appear in both the training and inference sets, the task is said to be transductive in nature. In contrast, an inductive task is one in which the nodes that appear in the training set are disjoint from those which appear in the inference set. A more extreme inductive scenario could involve completely separate graphs, where the model is trained on one or more graphs and then performs inference on fully novel graphs. Broadly speaking, transductive tasks measure a model’s ability to perform a task across an observed set of nodes, while inductive tasks measure a model’s ability to perform a task across an unseen set of nodes.

This section will provide a brief, non-comprehensive overview of several node-level (\(k=1\)), set-level (\(1 < k< |\mathcal{V}|\)), and graph-level (\(k=|\mathcal{V}|\), i.e. \(s=\mathcal{V}\)) tasks alongside their respective categorizations (classification vs. regression, transductive vs. inductive, etc.) and evaluation metrics.

2.2.1 Node-level Tasks

A node-level task considers a single node in a graph at a time; for example, when we may wish to categorize a user in a social network, or to estimate the volume of traffic a web page will receive. The former case refers to node classification, wherein we try to find a function \(f: \mathcal{V} \to \mathcal{C}\) that maps a single given node to one of several class labels. We measure the quality of \(f\) using metrics such as accuracy and \(F_1\) score. The latter task is an example of node regression, where \(f\) must map a given node to a real-valued scalar and is evaluated by mean average error. In both tasks, the dataset is partitioned node-wise into training and inference nodes \(V_\mathrm{train}, V_\mathrm{test} \subset \mathcal{V}\). In a transductive node-level task, the graph’s full set of edges are available during both training and inference along with the nodes’ features \(X \in \mathbb{R}^{|\mathcal{V}|\times D}\) (should they exist). The only information withheld is the label of each inference node in \(V_\mathrm{test}\). In the inductive setting, however, these edges and features are also withheld, meaning that the model will be evaluated on nodes it has never seen before.

2.2.2 Set-level Tasks

Several tasks involve making predictions on sets of nodes with cardinality in \([2, |V|-1]\), most notably including link prediction. In link prediction, a model must identify edges that are missing from the graph. In the classical setting, edges consist of two nodes, and so a model must accept an edge, a set \(s \in \mathcal{V}^2\), and perform binary classification to determine whether the edge belongs in the graph. Hyperedge link prediction generalizes this to higher values of \(k\) in hypergraphs. Binary link prediction can be evaluated using any classification-oriented metrics such as accuracy, \(F_1\), precision, and recall. While most frequently treated as a binary classification task, link prediction can also be a regression task when a graph is weighted. In this case, a model attempts to map an edge to the correct real-valued weight. Link prediction can also be trained and evaluated in the ranking setting, where instead of classifying edges as valid or invalid, models must return a relative ordering over a set of edges such that the edges which do exist are ranked higher than those which don’t. In this settings, models are trained with a ranking cost function such as margin loss, \(\mathcal{L}=\max(0, m+ \hat{y}_+ - \hat{y}_\neg)\), where \(\hat{y}_+\) is a model’s prediction for a positive (valid) edge and \(\hat{y}_\neg\) is a model’s prediction for a negative (invalid) edge. Intuitively, this cost function enforces that predictions for positive edges should be higher than for negative edges. Models trained for ranking-based link predictions can be evaluated using ranking metrics such as mean reciprocal rank (MRR) and mean average precision (MAP), and in the case of a weighted graph, graded relevance measures such as normalized discounted cumulative gain (NDCG).

For transductive link prediction, a random sample of the edges are withheld to form \(e_\mathrm{test}\), and the remaining edges as well as any node features are available during training and inference. The inductive case is similar to node classification in that a subset of nodes, along with their edges and features, are once again held out during training. At inference, however, not all of the edges associated with \(V_\mathrm{test}\) are made available as they are during classification; only a subset of these nodes’ edges are provided, and the task of the model is to use these along with any given node features to identify the remaining held-out edges.

A closely related task specific to knowledge graphs is question answering, where a model accepts a query graph \(Q=\{s, \tilde{E}\}: s \in \mathcal{V}^k, \tilde{E} \subset \mathcal{E}\) representing a logical question, which is the induced subgraph of \(\mathcal{G}\) over \(s\). The goal is to find the entity which satisfies the query graph, and the task can be evaluated using link prediction metrics. There are additional set-level tasks such as community detection. These tasks are out of the scope of this thesis, however, and are therefore left for further reading.

2.2.3 Graph-level Tasks

Finally, a graph-level task introduces the scenario when one instance is constituted by an entire graph. These tasks are similar to node-level tasks, primarily taking the simple form of either classification or regression, with the main difference being that graph-level tasks are all inherently inductive to some extent, as each instance presents a novel graph (though nodes can of course appear in multiple graphs). Graph classification is the task of mapping a graph to a class label \(\mathcal{G} \to \mathcal{C}\), with real-world applications such as identifying a category for a group of connected documents on Wikipedia or detecting if a discussion thread on a social media platform contains any instances of hate speech. It follows that graph regression is the task of mapping a graph to a real-valued scalar \(\mathcal{G} \to \mathbb{R}\), such as predicting the toxicity level of a chemical compound.

2.3 Graph Representation Learning

Representation learning (Bengio, Courville, & Vincent, 2013) is the task of computing \(d\)-dimensional real-valued feature vectors (or embeddings) \(x \in \mathbb{R}^d\) for each point in a given dataset. These feature vectors are then used as inputs to a downstream machine learning model to solve a classification or regression task, or for visualization and analysis of a dataset. In the context of graphs, this often means learning feature vectors for each node. The rise of neural-network based approaches to representation learning (Krizhevsky, Sutskever, & Hinton, 2012; Mikolov, Chen, Corrado, & Dean, 2013; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) made it possible to obtain high-quality feature vectors for large language and image datasets. This section will discuss how those approaches were extended to learn representations for graph-based data.

2.3.1 Random Walk and Factorization-based Methods

A common approach to learning word representations is by sampling sentences from a large corpus and optimizing the representation of each word to be closest to that of the words it most commonly co-occurs with. Perozzi et al. (2014) connected this process to performing random walks on graphs, resulting in the DeepWalk node embedding model. For a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) DeepWalk learns a node embedding function \(\Phi: \mathcal{V} \to \mathbb{R}^d\) by drawing a corpus \(\mathcal{C} = \{w_i\}\) of random walks, where \(w_i = [v_0, v_1, \ldots, v_n: v_i \in \mathcal{V}]\). The parameters of \(\Phi\) are optimized to maximize the co-occurrence probability of nodes which appear in close proximity in the same random walk(s) using the SkimGram method (Mikolov, Chen, et al., 2013). Each walk is split into \(2w+1\)-length windows, and we minimize the negative log likelihood of the first and last \(w\) nodes given the embedding of the middle node:

\[ \min_{\Phi}: \: -\log\mathrm{Pr}\left(\{ v_{i-w}, \ldots, v_{i-1}, v_{i+1}, v_{i+w} \} | \Phi(v_i) \right) \]

A number of related approaches quickly followed, most notably Node2Vec (Grover & Leskovec, 2016), which generalized the random walk extraction approach proposed in DeepWalk to allow configuration of depth versus breadth and restart probability in each walk. A concurrent line of work examined factorization of the adjacency matrix and its powers as an avenue to obtain node embeddings. LINE (Tang, Qu, Wang, et al., 2015) is a highly efficient factorization-based approach which preserves both first and second-order proximity in its resulting embedding space by decomposing. This was extended to the heterogeneous setting by PTE (Tang, Qu, & Mei, 2015), which jointly optimizes the embeddings of word-word, word-document, and word-label bipartite networks to produce content-aware node embeddings. In fact, factorization-based approaches are particularly well suited to highly multi-relational networks and schema-rich knowledge graphs, whose adjacency tensors can be decomposed to obtain knowledge graph embeddings (KGE) (Balazevic, Allen, & Hospedales, 2019; Bordes, Usunier, García-Durán, Weston, & Yakhnenko, 2013; Kazemi & Poole, 2018; Nickel, Tresp, & Kriegel, 2011; Trouillon et al., 2017).

Interestingly, connections have been identified between both lines of work. NetMF (Qiu et al., 2018; Xie et al., 2021) established equivalencies between DeepWalk, LINE, PTE, and node2vec under the unified umbrella of matrix factorization. In addition, the recently proposed ReFactor-GNN (Chen et al., 2022) demonstrated how factorization-based methods could be recast as message-passing architectures, the subject of the following section.

2.3.2 Graph Neural Networks

2.3.2.1 Neural Message Passing

The above random walk and matrix factorization-based approaches integrate multi-layer neural networks with graph-theoretic objective functions in order to produce node embeddings. While these methods are able to preserve certain properties of the underlying graph, they lack the strong inductive bias found in neural network architectures specialized for other domains, such as Convolutional Neural Networks for image processing. In contrast, Graph Neural Networks (GNNs) are generally based on the message passing framework (Gilmer, Schoenholz, Riley, Vinyals, & Dahl, 2017), in which nodes’ representations are computed as a function of their neighbors’ representations. As explained by (Bronstein, Bruna, Cohen, & Veličković, 2021; Veličković, 2022), for a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\) with node representations \(\mathbf{X} \in \mathbb{R}^{\mathcal{V} \times d}\), the message passing procedure in GNNs can be summarized as a sequence of three steps:

  • Message: A function \(\psi: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^k\) which computes a message between two nodes as a function of their representations;
  • Aggregate: A permutation invariant function \(\bigoplus: \mathbb{R}^{s \times k} \to \mathbb{R}^l\) which aggregates a set of vector-valued messages;
  • Update: A function \(\phi: \mathbb{R}^k \times \mathbb{R}^l \to \mathbb{R}^m\) which updates a node’s representation as a function of its current representation and the aggregated messages of its neighbors.

Several rounds of message passing can be performed consecutively to propagate node representations across multiple hops, increasing a GNN’s receptive field. In this case, we compute the embedding \(\mathbf{h}_v\) of a node \(v\) at the \(i^{th}\) round of message passing as:

\[ \mathbf{h}^{(i)}_{v} = \phi \left( \mathbf{h}^{(i-1)}_{v}, \bigoplus_{u \in \mathcal{N}(v)} \psi(\mathbf{h}^{(i-1)}_{v}, \mathbf{h}^{(i-1)}_{u}) \right) \]

The resulting node representations are then used to optimize an objective function for a downstream task as described above. In node classification, for example, a linear map \(\rho: \mathbb{R}^m \to \mathbb{R}^c\) (where \(c\) is the number of class labels) is applied to each node’s embedding \(\mathbf{h}_v\) to predict an output label.

2.3.2.2 Convolutional architectures

There are a number of instantiations of the message passing framework described above. An instance of a message passing model is considered convolutional if its message passing function satisfies \(\psi(\mathbf{h}_v, \mathbf{h}_u) = c_{v,u}\Psi(\mathbf{h}_v)\) where \(c_{v,u}\) is a fixed constant specific to the edge between nodes \(v\) and \(u\) (e.g., an edge weight). Kipf & Welling (2017) introduced the most notable instance of a convolutional message passing model, the Graph Convolutional Network (GCN). The GCN used \(\psi(\mathbf{h}_v, \mathbf{h}_u) =\frac{1}{|\mathcal{N}(v)}\mathbf{h}_u\) and \(\bigoplus = \Sigma\); intuitively, this corresponds to taking the mean of neighboring nodes’ representations. A learnable weight matrix \(\mathbf{W}\) was used to parameterize \(\phi\). This approach showed strong performance on semi-supervised node classification as well as link prediction (Kipf & Welling, 2016). Subsequent efforts improved the efficiency of the GCN model by simplifying the model architecture to consist almost exclusively of message passing, eliminating non-linearities and collapsing model parameters (X. He et al., 2020; Wu et al., 2019). Additional follow-up work (Schlichtkrull et al., 2018) showed that the message-passing paradigm extends naturally to the multi-relational setting by applying a GCN to each relation type. Of course, this becomes quite inefficient with a large number of relations, making it necessary in most cases to apply basis decomposition in order to effectively share parameters between different relations. To leverage existing work on knowledge graph embedding and obtain embeddings for relations in addition to nodes, CompGCN (Vashishth, Sanyal, Nitin, & Talukdar, 2020) applied KGE scoring functions (Bordes et al., 2013; Nickel, Rosasco, & Poggio, 2016; Yang, Yih, He, Gao, & Deng, 2015) to improve the quality of the output embeddings. This may, however, be unnecessary; Degraeve, Vandewiele, Ongenae, & Hoecke (2022) show that the inductive bias provided by message passing may be enough on its own, even when using completely random embeddings and untrained model weights.

2.3.2.3 Attentional architectures

Another class of message passing models are attentional models, which occur when \(\psi(\mathbf{h}_v, \mathbf{h}_u) = \alpha({\mathbf{h}_v, \mathbf{h}_u})\Psi(\mathbf{h}_v)\), where \(\alpha: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\) is a learnable function which maps a pair of node representations to a scalar weight, also known as an attention mechanism (Bahdanau, Cho, & Bengio, 2015; Vaswani et al., 2017). The output of the attention mechanism is a real-valued scalar signifying the importance of node \(u\) to node \(v\), and subsequently the weight of node \(u\)’s representation in computing node \(v\)’s representation. When using the sum operator as the aggregation function, this corresponds to taking a weighted average over neighboring nodes’ representations. This is the approach taken by the Graph Attention Network (GAT) (Velickovic et al., 2018), which computes importance weights as \(\alpha(\mathbf{h}_v, \mathbf{h}_u) = \mathrm{softmax}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_v; \mathbf{W}\mathbf{h}_u])\) where \(\mathbf{W} \in \mathbb{R}^{k \times d}, \mathbf{a} \in \mathbb{R}^{2k}\) are learnable parameters.

2.3.3 Expressiveness of Graph Neural Networks

Despite their empirically strong results, message passing architectures are inherently limited by their connections to the Weisfeiler-Lehman (WL) algorithm (Leman & Weisfeiler, 1968). The WL algorithm assigns a color to each node in a graph, which can be used to test for an isomorphism between two graphs by deriving a mapping between the nodes in both graphs based on matching colors (known as the WL test). For a graph \(\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}\), each node \(v \in \mathcal{V}\) is assigned an initial color \(c_v^{(0)} \in \mathcal{C}\). This color is iteratively updated over \(k\) steps (or until convergence) as \(c_v^{(i)} = \mathrm{hash}\left((L_v^{(i-1)}, c_v^{(i-1)})\right)\), where \(L_v^{(i-1)} = \{c_u^{(i-1)}: u \in \mathcal{N}(v) \}\) and \(\mathrm{hash}: \mathcal{C}^k \times \mathcal{C} \to \mathbb{Z}\) is an injective hashing function. Intuitively, a node’s color captures its \(k\)-hop neighborhood, and two nodes with the same \(k\)-hop neighborhood structure will receive the same colors. We can test for isomorphism between two graphs by applying the WL algorithm to each graph and comparing their color histograms; if they differ (i.e., one graph has a different number of nodes of a given color than the other), then we can reject the possibility of an isomorphism. Otherwise, it is possible that the graphs are isomorphic, but cannot be confirmed as the graphs may only be isomorphic up to \(k\) iterations.

Note that the WL algorithm fits our definition of message passing with \(\phi = \mathrm{hash}\), \(\bigoplus = \mathrm{multiset}\) (a function which returns its arguments as a multiset), and \(\psi(\mathbf{h}_v, \mathbf{h}_u) = \mathbf{h}_u\). This indicates that message passing GNNs (MP-GNNs) are only as powerful as the WL test; that is, if the WL test is unable to distinguish a pair of non-isomorphic graphs, neither will an MP-GNN. This means that if two non-isomorphic nodes share similar local \(k\)-hop neighborhoods (thus indistinguishable by \(k\) iterations of the WL test), then they will receive identical representations from MP-GNNs, making certain tasks like node classification impossible when the two nodes belong to different classes. Unfortunately, the limitations of the WL test are well-documented (Arvind, Köbler, Rattan, & Verbitsky, 2015; Kiefer, Schweitzer, & Selman, 2015), which has led to stream of research examining to what extent MP-GNNs share the same limitations (Feng, Chen, Li, Sarkar, & Zhang, 2022; Morris, Lipman, et al., 2021; Morris, Fey, & Kriege, 2021; Xu, Hu, Leskovec, & Jegelka, 2019) and ways in which they can be overcome (Balcilar et al., 2021; Maron, Ben-Hamu, Serviansky, & Lipman, 2019; L. Zhao, Jin, Akoglu, & Shah, 2022), which can be grouped into several classes of approaches as follows.

2.3.3.0.1 Feature augmentations

In general, surpassing the limits of the WL test is done by augmenting nodes with additional information so as to distinguish their representations when message passing alone cannot. A simple approach is to append one-hot encodings of each node’s unique ID to their embeddings, which will guarantee the distinguishability of all nodes; however, this technique is inherently unable to generalize to unseen nodes or new graphs with more nodes than seen during training. We can instead heuristics based on a node’s degree (C. Ying et al., 2021) or PageRank score (Postavaru et al., 2020) to derive uniquely identifying features for each node. In fact, even randomly generating additional features for each node has shown to be a surprisingly effective method for boosting the expressiveness of GNNs (Abboud, Ceylan, Grohe, & Lukasiewicz, 2020; Degraeve et al., 2022; Egressy & Wattenhofer, 2022; Sato, Yamada, & Kashima, 2020).

2.3.3.0.2 Positional encodings

Another approach is to imbue each node’s representation with an indication of its global position within its graph, also known as a positional encoding (PE) (You, Ying, & Leskovec, 2019). PEs can help differentiate nodes which share similar local structures while residing in distant locations in a graph, and conversely, help capture proximities between nearby nodes with differing local structures. Dwivedi, Luu, Laurent, Bengio, & Bresson (2022) propose two positional encoding schemes, Random Walk PE (RWPE) and Laplacian PE (LapPE). For node \(i\), we obtain \(\mathrm{RWPE}(i) = [\mathbf{RW}_{i,i}, \mathbf{RW}^2_{i,i}, \ldots, \mathbf{RW}^k_{i,i}]\) where \(\mathbf{RW}^{l} = (AD^{-1})^l\) is the \(l\)-step transition matrix, meaning each dimension \(\mathrm{RWPE}(i)_j\) represents the probability that node \(i\) will end on itself along a \(j\)-step random walk. Similarly, \(\mathrm{LapPE}(i) = [\mathbf{U}_{i,0,}, \mathbf{U}^2_{i,1}, \ldots, \mathbf{U}^k_{i,k}]\) where \(U \in \mathbb{R}^{|\mathcal{V}| \times k}\) are the \(k\) smallest non-trivial eigenvectors of the graph Laplacian. However, this approach is limited by the sign ambiguity of eigenvectors which leads to \(2^k\) possible signs for \(k\) eigenvectors, requiring the signs to be randomly flipped during training (Dwivedi et al., 2020; Kreuzer, Beaini, Hamilton, Létourneau, & Tossou, 2021). Additionally, many of the random walk and factorization-based methods discussed in Section 2.3.1 such as DeepWalk and LINE can be considered positional encodings as they too capture node proximities. One issue is that the absolute values of these methods may not be stable across graphs, as an independent model must be fit on each graph. H. Wang, Yin, Zhang, & Li (2022) propose the Position Equivariant Graph Neural Network (PEG) as a solution; by updating node features and positional encodings separately, they maintain permutation equivariance and improve the stability of node embeddings.

The vast array of approaches described above were comprehensively evaluated by Rampášek et al. (2022), who devised a highly general framework - GraphGPS - for graph learning including positional and structural (discussed in Section 2.4.2.1.2) encodings, message passing layers, and global attention mechanisms. In GraphGPS, a combination of positional and structural encodings are used to obtain node features \(\mathbf{X}\) and edge features \(\mathbf{E}\). Each layer of GraphGPS uses a message passing layer \(\mathbf{X}_M = \phi(\mathbf{X}, \mathbf{E}, \mathbf{A})\) to propagate the node and edge features within local neighborhoods, a global attention layer \(\mathbf{X}_T = \alpha(\mathbf{X})\) to incorporate information from across the whole graph, and finally, the local and global embeddings are aggregated by a multi-layer neural network \(\psi(\mathbf{X}_M + \mathbf{X}_T)\). This framework is not only provably expressive and scalable, it is also highly modular, allowing the various components of the model (such as choice of positional and structural encoding or message passing layer) to be substituted depending on the needs of the task at hand. The authors carefully examined each component of their framework, observing that while global attention can be helpful, the choices of positional and structural encodings and message passing layer were far more important to improving model performance. This work is highly relevant to the present thesis from the perspective of designing and comprehensively evaluating a population of models within a standardized framework for graph learning tasks. However, GraphGPS presented a more general framework aiming for applicability to all graph learning tasks, whereas this work focuses solely on building generalizable, inductive models for the link prediction task. In the following section, we will briefly review other works which examine populations of models, also known as design spaces.

2.3.4 Design Space of Graph Neural Networks

With such a wide range of methods for learning graph representations, it is important to evaluate which methods are best suited for various applications. Examining populations of models in this way is also known as profiling a design space (Radosavovic, Johnson, Xie, Lo, & Dollár, 2019). The first large-scale study on design spaces for Graph Neural Networks was contributed by GraphGym (You et al., 2020), which proposed a number of fundamental “design dimensions” for GNNs, including:

  • Batch normalization
  • Dropout
  • Non-linear activation
  • Agggregation function
  • Number of message passing layers
  • Layer connectivity
  • Pre-processing layers
  • Post-processing layers
  • Batch size
  • Learning rate
  • Optimizer
  • Training epochs

With multiple choices for each dimension, there are over 10 million possible model configurations in this design space, making it unrealistic to evaluate all possible models. To make this tractable, the authors devised the Controlled Random Search (CRS) experimental procedure which uses a randomized block design to evaluate each dimension. For example, consider a study wishing to find the optimal batch size \(\mathrm{BatchSize} \in \{32, 64, 128\}\) within the design space \(\{\mathrm{BatchNorm}: [\mathrm{True, False}], \mathrm{LR}: [0.01, 0.001, 0.0001]\}\). We would sample \(|S|\) configurations from the cartesian product of the free dimensions \(S = \mathrm{BatchNorm} \times \mathrm{LR}\), and then take the cartesian product with the fixed dimension \(S \times \mathrm{BatchSize}\). This greatly reduces the overall quantity of experiments that must be run to identify the optimal batch size while simultaneously providing a consistent evaluation across all choices for the dimension of interest. This led to a number of interesting findings such as batch normalization being more helpful than dropout, \(\mathrm{PReLU}\) being highly effective as a non-linear activation function, and a batch size of \(32\) being preferable to either \(16\) or \(64\). It also confirmed several empirically held beliefs, such as \(\mathrm{sum}\) being the best aggregation function, Adam being generally superior to SGD, and more training epochs lead to better performance.

The methodology proposed by GraphGym was extended to the dynamic graph setting by ROLAND (You, Du, & Leskovec, 2022), who used CRS to conclude that batch normalization, skip-connectivity, and max aggregation were optimal design choices for their proposed architecture, and to the heterogenous setting by T. Zhao et al. (2022), who found that their results on batch normalization, number of message passing layers, optimizer, and training epochs were similar to those of GraphGym. They also found that heterogenous GNNs had several important differences, such as Dropout being a prerequisite of good performance, which the authors suggest is due to over-parameterization created by the node and edge-type specific model weights. Finally, a particularly relevant study was conducted by Z. Wang, Zhao, & Shi (2022), who profiled the design space of models for the collaborative filtering (CF) task. Through their evaluation, the authors identified a “pruned” designed space which was shown to contain a higher concentration of well-performing models, an important contribution for both future research and applications as it provides a significantly improved “starting point” for researchers and practitioners designing CF models. This work is of great interest to the present thesis as it studies the design space of collaborative filtering, which can be formulated as a link prediction problem. However, the study also examines non-GRL models specific to CF whereas we focus solely on GRL-based models. Their study also focuses solely on performance in the transductive setting, as their generalization analysis examines how well their reduced design space transfers to new datasets, not how actual trained models can be applied to new datasets. This thesis goes beyond the transductive setting to investigate the applicability of existing trained models to completely novel datasets, evaluating their effectiveness in the inductive link prediction task where all nodes are previously unseen.

2.4 Link Prediction

In this section, we will review the existing body of literature concerning the application of deep learning methods to the link prediction task. We will first cover the earliest deep learning-based approaches to link prediction, which were followed by the emergence of Graph Neural Networks (GNN). We will then discuss the challenge of inductive link prediction, including the approaches and datasets contributed to its line of research.