Chapter 3 Design Space for Inductive Link Prediction

In this section, we will outline our methodology in profiling the design space of models for Inductive Link Prediction (ILP). We will share our proposed standardized architecture consisting of several components, covering existing methods for each before detailing the experiments necessary to verify our hypotheses.

3.1 Standardized Architecture

To identify an optimal model (or set of models) for inductive link prediction, we require a standardized, modular architecture which allows us to swap in and out its various components to study the effects of each design choice. Furthermore, in order for the insights yielded by our experiments to be usable in real-world scenarios, our architecture must satisfy several criteria:

Efficient. Realistic graph datasets often contains tens or hundreds of millions of nodes, and hundreds of millions - if not billions - of edges. Existing GNN architectures which operate on the full adjacency matrix of a graph are simply unable to scale to this size. As such, our framework must be able to operate on a reduced subset of a graph’s adjacency matrix if necessary.
Effective. Of course, an efficient, poorly performing model is no more useful than an inefficient, well-performing model. Therefore, our framework must maintain strong performance even in resource-constrained scenarios.
Generalizable. Finally, our framework must be fully inductive, meaning that it must not depend on (1) the specific identities of nodes in a given graph, or (2) the presence of pre-existing node or edge attributes for a given graph. The former criterion is important not only for cross-graph generalization ability, but also for efficiency as it will allow the model to operate on arbitary subgraphs if necessary. The latter criterion will increase the applicability of the framework to more realistic scenarios involving non-attributed graphs (such as many social networks).

Standardized architecture for inductive link prediction (ILP) models. Given an input graph, a model instance computes positional encodings on the full graph before sampling subgraphs around each target link. A structural encoding is derived for nodes in each subgraph, which is concatenated with their positional encoding to form positionally and structurally-aware node embeddings. The embeddings are propagated using local message passing layers before being pooled into a subgraph embedding which is used to classify the existence of each target link.

Figure 3.1: Standardized architecture for inductive link prediction (ILP) models. Given an input graph, a model instance computes positional encodings on the full graph before sampling subgraphs around each target link. A structural encoding is derived for nodes in each subgraph, which is concatenated with their positional encoding to form positionally and structurally-aware node embeddings. The embeddings are propagated using local message passing layers before being pooled into a subgraph embedding which is used to classify the existence of each target link.

Given the above criteria, we propose a simple, standardized model architecture for inductive link prediction (ILP) comprised of 6 components, shown in Figure 3.1. The modularity of our proposed architecture allows us to swap out different choices for each component to profile the overall design space. It does not depend on the presence of node or edge attributes but is able to utilize them without any additional modification (though we choose not to in this work to focus on learning and generalizing solely from the structure of the graph). Our framework scales to large graphs by operating on $k$-hop subgraphs around each target link while (optionally) preserving global information. Concretely, an instance of our framework contains the following components:

Global encoding (optional). Given an input graph $\mathcal{G} = \{\mathcal{V}, \mathcal{E}\}$, a global encoding module is used to extract global structure-aware node embeddings $\mathbf{X}_g \in \mathbb{R}^{|\mathcal{V}| \times m}$. The importance of this module is to generate representations that uniquely distinguish the nodes in a given graph to help differentiate nodes with isomorphic $k$-hop neighborhoods. This module must be efficient enough to operate on the full adjacency structure of the graph, and may also be omitted entirely if a graph’s local neighborhoods are sufficiently distinguishable. Candidates for this component include simple, lightweight heuristics such as node degree, centrality metrics, or even randomly generated features, as well as more sophisticated options such as positional encodings.
Subgraph extraction. Following the SEAL framework (Zhang & Chen, 2018), for each target link $\bar{e}_i = ((u, v), y_i): u, v \in \mathcal{V}, \bar{e} \notin \mathcal{E}, y_i \in \mathcal{Y}$ which we wish to predict, we extract a $k$-hop subgraph $H_{u,v}^{(k)} = \{\mathcal{V}', \mathcal{E'}\} \subset \mathcal{G}$ enclosing $u$ and $v$. Each node and edge carries their respective attributes, including the global encoding from step 1 if applicable, into the subgraph $H_{u,v}^{(k)}$ as $\mathbf{X}_a \in \mathbb{R}^{|\mathcal{V}'| \times d}$ and $\mathbf{X}_g \in \mathbb{R}^{|\mathcal{V}'| \times m}$ respectively. This step makes it tractable to operate on large graphs by limiting the receptive field of the model, while still preserving global information from step 1 and node/edge attributes if provided (which we ignore for the purposes of this study).
Local encoding (optional). Given an enclosing subgraph $H_{u,v}^{(k)}$, the local encoding component derives local structure-aware node embeddings $\mathbf{X}_s \in \mathbb{R}^{|\mathcal{V}'| \times n}$. This component helps distinguish otherwise isomorphic $k$-hop subgraphs by computing context-sensitive representations with respect to the nodes $u, v$ in the target link. These representations can either be real-valued embeddings, or integer labels $\ell \in \mathbb{Z}_+^l$ which are used as inputs to an embedding layer. If computing integer labels, the labels must generalize between graphs. A simple example of a generalizable label would be Distance Encoding (Li, Wang, Wang, & Leskovec, 2020), which returns a label tuple $\mathbf{x}_q = (\mathrm{ssp}_{q,u}, \mathrm{ssp}_{q,v})$ for all $q \in H_{u,v}^{(k)}$, representing the shortest path distance from every node to each node in the target link. The labels are then embedded and concatenated to form the local encoding. This step may also be omitted if global information is sufficient to distinguish nodes in the graph.
Message passing (optional). Given an enclosing subgraph $H_{u,v}^{(k)}$ with node attributes and global and local encodings $\mathbf{X}_a\in\mathbb{R}^{|\mathcal{V}'| \times d},\mathbf{X}_g\in \mathbb{R}^{|\mathcal{V}'| \times m}, \mathbf{X}_s \in \mathbb{R}^{|\mathcal{V}'| \times n}, d,m,n \geq 0$, this component uses between one or more message passing layers to propagate node embeddings throughout the local subgraph. Note that omissions of node attributes and global and local encodings are accomodated here by allowing $d = 0$ (no attributes), $m = 0$ (no global encoding), and $n = 0$ (no local encoding), though of course we must have at least one of $d > 0 \lor m > 0 \lor n > 0$, otherwise we have no features to learn from. Any given message passing layer can be used here by propagating the concatenated attributes and encodings $[\mathbf{X}_a; \mathbf{X}_g; \mathbf{X}_s]$ to incorporate information from all available aspects. Additional flexibility is also provided for specific MPNNs; for example, edge features can also be passed to the MPNN layer if it supports it, and PEG (H. Wang et al., 2022) or GraphGPS (Rampášek et al., 2022) can be used as the message passing layer by using a positional encoding in step 1 and separately passing $[\mathbf{X}_a; \mathbf{X}_s]$ as node attributes and $\mathbf{X}_g$ as positional encodings.
Subgraph pooling. With node embeddings $\mathbf{X} \in \mathbb{R}^{|\mathcal{V}'| \times p}$ that capture all aspects available in the enclosing subgraph, $H_{u,v}^{(k)}$, the subgraph pooling component maps $\mathbb{R}^{|\mathcal{V}'|\times p} \to \mathbb{R}^c$. The pooling function must be length agnostic and position invariant, capable of aggregating the unordered set of node embeddings to a fixed-size subgraph embedding. The most simple choice would be a simple arithmetic operation such as summation, while a more sophisticated approach might use a learnable aggregation function such as the attention mechanism.
Subgraph classification. Finally, the fixed-size subgraph embedding is put through a fully-connected classification layer mapping to an output $\hat{y}_i \in \mathcal{Y}$, where the label space $\mathcal{Y}$ is usually binary in $0, 1$ but can extend to multi-class and ordinal spaces as well as a continuous space for regression problems with a properly-chosen objective function. For this work, we focus on the traditional task of binary link prediction, and optimize the model end-to-end using the binary cross-entropy loss.

3.2 Proposed Design Space

We examine a variety of different global and local encodings as well as message passing layers to gauge their effects on generalization. For PEs and SEs, we include a $null$ option to examine the effect of withholding each respective stage. We examine all $PE \times SE \times MPNN$ combinations, except for $(null, null, \_)$ combinations (as there would be no node features) and $(null, \_, PEG)$ combinations (as PEG requires a positional encoding input).

3.2.0.0.0.1 Positional Encodings

Random Node Initialization (RNI).

Random Walk (RW).

Laplacian Eigenvectors (LE).

RandNE.

HOPE.

3.2.0.0.0.2 Structural Encodings

Distance Encoding (DE).

Distance Encoding Plus (DE+).

Double Radius Node Labeling (DRNL).

3.2.0.0.0.3 Message Passing Layers

Graph Convolutional Network (GCN).

Graph Isomorphism Network (GIN).

Graph Attention Network (GAT).

GraphSAGE.

PEG.

Table 3.1: Design space
Design Dimension	Choices
Positional Encoding	None, Degree, RW, RNI, HOPE, RandNE, LE
Structural Encoding	None, DE, DE+, DRNL
Message Passing Layer	GCN, GIN, GAT, GraphSAGE, PEG
Subgraph Pooling	Sort Pooling ($k=30$)
Hidden Layers	1,2,3
Hidden Channels	32,64,128
Optimizer	Adam, SGD
Learning Rate	0.01, 0.001, 0.0001
Batch Size	64

3.3 Experimental Design

The following section will detail the specific experiments we will conduct to verify the hypotheses detailed in Section 1.2. Each of the experiments we run will consider the design space $\mathcal{D}$ shown in table 3.1 across the task space $\mathcal{T}$ shown in table 4.1. Evaluating the full design and task spaces would require considering $\prod_{d \in \mathcal{D}}(|d|) \cdot \sum_{t \in \mathcal{T}} ({}^{|t|}\!P_2) = 7560 \cdot 418 = 3,160,080$ different trials, with each potentially repeated across multiple random seeds. To mitigate the prohibitive costs of such a large number of trials, we employ Controlled Random Search (CRS) (You et al., 2020) as an effective way to reduce the number of necessary trials by sampling. The process is demonstrated visually in Figure 3.2. As a simple example, suppose we hypothesize that DRNL is a superior structural encoding to DE+. This makes our design dimension of interest $\hat{d}=SE$. We first draw $|S|$ configurations $S = [s_i = \{\ s_{i, d}\approx d: d \in \mathcal{D}, d \neq \hat{d} \}]$; simply put, we randomly sample $|S|$ configurations across all design dimensions (including tasks/datasets) other than the one of interest. The final set of experiments is obtained by taking the cartesian product $E = S \times \hat{d} = [(i, s_i, \hat{d}_j): i \in [1, |S|], \hat{d}_j \in \hat{d}]$ of the sampled configurations $S$ with the possible values of our dimension of interest $\hat{d}$, maintaining a reference $i$ to the ID of the sample group (shown as Group No. in Figure 3.2). Conducting each experiment yields a performance metric, for which we compute the rankings grouped by the sample group number $i$. Intuitively, this represents the performance rank $r_{\hat{d}_j, i}$ of each value $\hat{d}_j$ within a group $i$ where all other parameters are held constant. Finally, we take the mean rank $r_{\hat{d}_j} = \frac{1}{|S|}\sum_{i=1}^{|S|} r_{\hat{d}_j, i}$ for each $\hat{d}_j$ to obtain the expected rank for each value of $\hat{d}$ when all other parameters are held constant. In this case, with $d_{SE} = \{DRNL, DE+\}$ and $\hat{d} = d_{SE}$, we decrease the total number of required experiments from $1,580,040$ to just $|S| \cdot |\hat{d}| = 2|S|$, where $|S|$ is specified by the user. Taking $|S|=75$ for instance, only $150$ trials are required to be executed, a reduction of over $10,000$ times.

$Example of Controlled Random Search to evaluate whether DRNL or DE+ is a better structural encoding. We take the cartesian product $S \times \hat{d}$ of a set of $S$ random configurations with the possible values for our dimension of interest $\hat{d}$ to obtain our experiment set, and compute the average in-group rank for each value to determine the optimal choice. Based on the rows visible in the figure, DRNL would have an average rank of $1.33$ versus DE+'s average rank of $1.67$, making DRNL the preferable choice.$

Figure 3.2: Example of Controlled Random Search to evaluate whether DRNL or DE+ is a better structural encoding. We take the cartesian product $S \times \hat{d}$ of a set of $S$ random configurations with the possible values for our dimension of interest $\hat{d}$ to obtain our experiment set, and compute the average in-group rank for each value to determine the optimal choice. Based on the rows visible in the figure, DRNL would have an average rank of $1.33$ versus DE+’s average rank of $1.67$, making DRNL the preferable choice.

For all experiments, we use $|S|=25$ for Controlled Random Search (CRS) and repeat each trial with $3$ different random seeds, yielding $75|\hat{d}|$ trials in total per experiment. We allow the source and target graphs to vary but sample them in pairs from the same dataset (e.g., Cora and PubMed, Twitch-ES and Twitch-PT, etc) to ensure that task similarity remains relatively similar between samples. As the graphs are also of varying size, we limit each trial to $2500$ training steps. The result of each trial will yield measurements of accuracy, AUC, F1, precision, and recall on the test split for both the source and target graphs.

3.3.1 Evaluation of Model Generalization

The first experiment we will conduct performs CRS on each design dimension $d \in \mathcal{D}$ shown in Table 3.1, producing results for $N = C \cdot \sum_{d \in \mathcal{D}} |d|$ trials in total. For each metric $m \in \mathcal{M}$, we then compute the mean signed difference $\Delta_m = \frac{1}{N}\sum_{i=1}^N (m_{i,t} - m_{i,s})$ where $\Delta_m \in [-1, 1]$ (as all $m \in [0, 1]$), $m_{i,s}$ is the value of a given metric on the source graph for trial $i$, and $m_{i,t}$ is defined similarly for the target graph. The value of $\Delta_m$ will determine the validity of Hypothesis 1.1 by measuring the degree to which the performance of an ILP model changes when applied to unseen data. The null hypothesis $H_\emptyset: \forall m: \Delta_m \geq 0$ asserts that ILP models will not show a decrease in performance from the source graph to the target graph on average, while the experimental hypothesis $H_1: \forall m: \Delta_m < 0$ asserts the contrary.

In addition, this experiment will also provide the necessary evidence to verify Hypothesis 1.2. We will compute the mean of the ranks for each $m$ of each design choice within each experiment group as shown by Figure 3.2. This will yield an average rank $r_{d_i,m}$ for each candidate choice $d_i \in d, d \in \mathcal{D}$ and each metric $m$, as well as the mean signed difference in source and target ranks $\Delta_{m,r_{d_i}} \pm \sigma_{d_i, m}$ computed similarly as above, which measures generalization ability. We obtain the average rank delta by design dimension $\Delta_{r_d} = \frac{1}{|\mathcal{M}|\cdot|d|}\sum_{m \in \mathcal{M}}\sum_{d_i \in d} \Delta_{r_{d_i}, m}$ and the average standard deviation by design dimension $\sigma_{d} = \frac{1}{2\cdot|\mathcal{M}|\cdot|d|}\sum_{m \in \mathcal{M}}\sum_{d_i \in d} \sigma_{d_i, m}$. These two measures describe the average change and spread of expected rank by design dimension, respectively. Using a one-way ANOVA with Bonferroni correction, we can determine the effect $\eta^2_{d, r}$ of each design dimension on generalization as well as its variability $\eta^2_{d, \sigma}$, with associated $\rho_{d, r}$ and $\rho_{d, \sigma}$. The null for Hypothesis 1.2 states $H_\emptyset: (\eta^2_{d,r} \to 0) \land (\eta^2_{d,\sigma} \to 0)$, which is to say that all design dimensions will have little to no effect on generalization with little to no variance. In contrast, the experimental hypothesis $H_1: d \in \{\mathrm{PE, SE}\} \implies (\eta^2_{d,r} > 0) \land (\eta^2_{d,\sigma} > 0)$ asserts that the Positional Encoding and Structural Encoding design dimensions will have statistically significant effects on generalization with large variability between the choices for each.

3.3.2 Identification of Condensed Design Space

Based on the results of the second experiment described in Section 3.3.1, we will construct a condensed design subspace $\hat{\mathcal{D}} = \{\hat{d}_i\}$ such that $\forall i: \hat{d}_i \subset d_i$. Such a subspace would be of great practical utility to both practitioners and researchers interested in the ILP task, as it would greatly reduce the cardinality of the design space to be explored when searching for optimal models. For our proposed subspace to be useful, however, we must verify that it is actually superior to the full design space. As such, we will apply the same technique from the first experiment described in Section 3.3.1 on samples of models from both the proposed subspace and full design space to obtain $\Delta_{m, \hat{\mathcal{D}}}$ and $\Delta_{m, \mathcal{D}}$ respectively. However, it is important that we control for the underlying datasets as well; as it may be easier or harder to generalize between certain datasets, we must ensure that both samples are evaluated on the same datasets. Therefore, we will obtain each sample by applying CRS using the source and target graph pair as the fixed parameter of interest. This means that we sample $S$ models from both the subspace and full space, and evaluate each sample across all source/target graph pairs. For this experiment, we have the null hypothesis $H_\emptyset: \sum_{m \in \mathcal{M}}(\Delta_{m, \hat{\mathcal{D}}} - \Delta_{m, \mathcal{D}})^2 \leq 0$, signifying no improvement in generalization ability from the proposed subspace upon the full design space, and the experimental hypothesis $H_1: \sum_{m \in \mathcal{M}}(\Delta_{m, \hat{\mathcal{D}}} - \Delta_{m, \mathcal{D}})^2 > 0$ expecting the opposite.

3.3.3 Quantification of Task Difficulty

The previous experiments have constrained graph pairs to originate from the same domain (e.g., citation networks or social networks), under the assumption that these domains are not created equally; that some might be more (or less) challenging for ILP models to generalize between. In our final experiment which covers Hypothesis 1.4, we wish to empircally validate this assumption by comparing the generalization abilities of the examined ILP models with spectral distance measures between the source and target graphs. We will sample a fixed set of $M$ model architectures and $P$ source-target graph pairs, removing the domain constraint to allow for combinations of graphs from different domains (such as a citation network $\to$ social network pair). For each configuration in $(m, (g_s, g_t)) \in M \times P$, we will evaluate the generalization ability of the model between the source and target $\Delta_m$ as well as the spectral distance $d(g_s,g_t)$. The null hypothesis $H_\emptyset$ for this experiment suggests that there should be no correlation between $\Delta_m$ and $d(g_s,g_t)$, while the experimental hypothesis $H_1$ expects a negative correlation between the spectral distance and generalization ability; that is, the more different two graphs are, the more difficult it will be to generalize between them.