Chapter 4 Evaluation

4.1 Methodology

Table 4.1: Statistics for the graph datasets used in our experiments. The number of graphs in each dataset is shown in the third column; for PPI and Twitch, we sample from the total number of graphs in the dataset which is enclosed in parentheses.
Dataset	Domain	# Graphs	# Nodes	# Edges	Degree	Density
Planetoid	Citation	3	8584.00	36102.00	3.71	0.0017
PPI	Chemistry	2 (20)	2495.50	66483.00	24.89	0.0200
Amazon	Product	2	10701.00	364942.00	33.44	0.0067
Twitch	Social Network	4 (6)	5059.25	124394.75	26.65	0.1515

4.1.0.0.1 Datasets

Throughout all of our experiments, we consider graphs from four datasets spanning different domains - Planetoid (paper citations), PPI (protein-protein interaction), Amazon (product co-purchase), and Twitch (social relationships). Using graphs from multiple domains helps ensure the generalizability of our results between graphs of different sizes, densities, and interaction dynamics. We provide a brief overview of the datasets in Table 4.1. The number of graphs in each dataset is shown in the third column; for PPI and Twitch, we sample from the total number of graphs in the dataset which is enclosed in parentheses. The following columns show the average number of nodes, number of edges, degree, and density found in the graphs used in our experiments for each domain.

4.1.0.0.2 Experimental Procedure

We prepare the data for each trial by partitioning each graph’s edges into message passing, training, validation, and testing sets. We reserve $10\%$ of edges for validation and $20\%$ for testing. We compute global encodings using only the training edges to prevent data leakage. We follow our proposed standardized model architecture comprised of global and local encodings, message passing layers, and subgraph pooling as described in Section 3.1. Each model is trained by minimizing the Binary Cross-Entropy (BCE) loss for a maximum of $2500$ steps and evaluated on the complete set test. Our initial threshold for statistical significance is $\alpha = 0.05$ for all hypothesis tests.

4.2 Evaluation of Model Generalization

4.2.1 Aggregate analysis

Figure 4.1: Distribution of performance metrics on the cross-graph link prediction task.

Our first object Hypothesis 1.1 asserts that on the aggregate, ILP models will demonstrate a moderate decrease in performance when applied to new data. This hypothesis is tested by conducting a large number of trials in which various performance measures are obtained on the test edges for both the source and target graphs. In particular, we Controlled Random Search (CRS) for each of our proposed design dimensions, and aggregate the results to assess the overall behavior of ILP models. With $|S|=25$ and each trial repeated across $3$ different random seeds, we ultimately conduct $2,025$ trials, obtaining the AUC, Accuracy, F1, Precision, and Recall measures for the test edges on both the source and target graphs.

Table 4.2: We find a statistically significance decrease in performance when applying ILP models to unseen data.
Metric	Source	Target	Z	p
AUC	0.7476 ($\pm 0.1469$)	0.7321 ($\pm 0.1444$)	13.5506	2.45E-40
Accuracy	0.6567 ($\pm 0.1275$)	0.6484 ($\pm 0.1182$)	9.7255	3.67E-22
F1	0.4973 ($\pm 0.3047$)	0.4955 ($\pm 0.2937$)	0.9092	1.82E-01
Precision	0.7385 ($\pm 0.2653$)	0.7245 ($\pm 0.2481$)	5.4389	3.02E-08
Recall	0.4678 ($\pm 0.3306$)	0.476 ($\pm 0.3335$)	-3.3605	1.00E+00

We use a paired-sample one sided $t$-test with the null hypothesis $H_\emptyset: \forall m: \: \bar{m}_s = \bar{m}_t$, that the means of the source and target samples for each metric are identical. The alternative hypothesis $H_1: \exists m: \: \bar{m}_s - \bar{m}_t < 0$ states that there exists some metric for which performance decreases from the source graph to the target. Of course, evaluating these hypotheses repeatedly for multiple metrics increases our chance of Type-I error. We control for this using Bonferroni correction, which sets our threshold for statistical significance at $\alpha = \frac{0.05}{5} = 0.01$. Under this regime, we find a statistically significant decrease in AUC, Accuracy, and Precision when applying ILP models to the cross-graph link prediction task. However, we do not find a significant difference in F1 or Recall. In spite of this, we nonetheless reject the null hypothesis $H_\emptyset$ by showing the existence of a performance decrease in one or more metrics, with the relevant significance measures and test statistics shown in Table 4.2.

Figure 4.1 visualizes the distribution of each metric in both the source and target settings, showing that on average, mass was placed higher for the source metrics than their corresponding target metrics. This result indicates that practitioners may well experience difficulties when applying link prediction models to unseen data; however, it is still possible to achive comparable (albeit slightly diminished) performance. This result also signifies the need for models which are better able to generalize to unseen data to mitigate these difficulties. In addition, we can also observe that in both settings, models are able to achieve strong Precision but struggle with Recall, which may not be suitable for all applications. The existing literature has paid little attention to recall-oriented measures, which is seemingly reflected in these results and a potential direction for future work exploring the development of recall-optimized architectures.

4.2.2 Performance by Design Dimension

The following section will examine the performance of each choice for each design dimension, across all metrics on both source and target datasets to comprehensively capture the strengths and weakness of each.

4.2.2.0.1 Positional encoding

Figure 4.2 shows the surprising result that none of the positional encoding methods are as effective in this setting as a simple degree-based encoding. The degree-based PE outperforms all others in all metrics on both source and target graphs. Overall, the top three best-performing methods - degree, random node initialization, and random walk - can also arguably be considered the simplest methods, suggesting that high-powered positional encodings do not demonstrably improve performance relative to more lightweight alternatives, and are sometimes even worse than not using a positional encoding at all.

Figure 4.2: Ranking analysis of positional encodings across all metrics. Lower is better.

4.2.2.0.2 Structural encoding

Among structural encodings, it is clear that Double Radius Node Labeling (DRNL) and Distance Encoding Plus (DE+) are markedly superior to Distance Encoding (DE+), as shown by Figure 4.3. Interestingly, there is a pronounced difference between all three options and None, indicating that using a structural encoding is vital for inductive link prediction.

Figure 4.3: Ranking analysis of structural encodings across all metrics. Lower is better.

4.2.2.0.3 Message passing network

It is shown by 4.4 that the Graph Convolutional Network (GCN) consistently outperforms the alternatives in terms of AUC, accuracy, F1, and recall. However, GraphSAGE produces the highest precision, providing a good example of the importance of considering performance across multiple metrics. These findings suggest that most practitioners in general should opt for the GCN, but that those focused on tasks requiring high precision should actually choose GraphSAGE instead.

Figure 4.4: Ranking analysis of message passing networks across all metrics. Lower is better.

4.2.2.0.4 Message passing layers

When considering the number of message passing layers used, a less is more effect is found. Figure 4.5 shows that in most cases, it seems that one message passing layer is sufficient, if not outright superior, and in almost every case using three message passing layers hampers performance. We believe that this is likely the effect of the well-studied oversmoothing phenomenon in message passing networks, in which node embeddings tend to converge to a similar point as the number of message passing rounds increases, ultimately making it difficult to distinguish positive and negative subgraphs.

Figure 4.5: Ranking analysis of message passing layers across all metrics. Lower is better.

4.2.2.0.5 Hidden channels

Another interesting finding comes in the number of hidden channels used to embed nodes within the model. Similarly to the number of message passing layers, Figure 4.6 shows the same less is more effect is present in the number of hidden channels. In most cases, either $32$ or $64$ channels is optimal for best performance, with $128$ channels consistently leading to poor performance. This finding can help practitioners reduce training time and increase their model efficiency by avoiding needlessly large models.

Figure 4.6: Ranking analysis of hidden channels across all metrics. Lower is better.

4.2.2.0.6 Optimizer

Figure 4.7 paints a somewhat unclear picture of which optimizer is best for the inductive link prediction task. Adam is superior in AUC and precision, while SGD is stronger in F1 and recall, and both are comparable in terms of accuracy. This aligns with findings from previous work that Adam is generally comparable or superior to SGD in a naive setting, though SGD can be better when tuned. Though the outright best choice is unclear, our findings help guide practitioners by demonstrating the nuanced differences between the two on a per-metric basis.

Figure 4.7: Ranking analysis of optimizers across all metrics. Lower is better.

4.2.2.0.7 Learning rate

Finally, it is seen in 4.8 that a lower learning rate of $0.0001$ is generally better for performance. In addition, we again observe a difference precision, which is increased with a slightly higher learning rate of $0.001$.

Figure 4.8: Ranking analysis of learning rate across all metrics. Lower is better.

4.2.3 Dimension-level effects

Having considered the intra-dimension differences between the choices for each design dimension, we now turn our attention to inter-dimension differences. For Hypothesis 1.2, we wish to establish if any design dimensions make a statistically significant impact on generalization ability. Knowledge of such a difference would inform us that certain dimensions may be more important than others when identifying well-generalizing models, allowing us to allocate compute resources more efficiently by spending less time optimizing dimensions that do not significantly impact performance. Note that this does not speak to the importance of the dimension itself, but rather the range of the choices for a given dimension; for example, the choice in learning rate may be very important, but there may not be a significant difference between the examined values $0.01$, $0.001$, and $0.0001$.

Table 4.3: We observe a statistically significant difference in performance between dimensions of our design space for four out of five metrics.
	F	p
Accuracy	6.8304	3.48E-07
AUC	3.9559	6.14E-04
F1	4.0272	5.14E-04
Precision	1.2034	3.01E-01
Recall	3.1296	4.69E-03

In order to test for the existence of a difference in performance metrics across design dimensions, we use a one-way ANOVA with Bonferroni correction ($p = \frac{0.05}{5} = 0.01$). We use the same observations from Section ??, in which we use Controlled Random Search to sample $25 \cdot |d|$ trials for each design dimension $d \in \mathcal{D}$ and repeating each trial with $3$ random seeds, resulting in $2,025$ total trials. For all metrics from each trial we compute the generalization score as the difference between the metric on the source and target graphs. Our null hypothesis $H_\emptyset: \forall \mu_{d_i, m}, \mu_{d_j, m}: \: \mu_{d_i, m} - \mu_{d_j, m} = 0, d_i, d_j \in \mathcal{D}$ asserts that for each metric $m$, the mean generalization ability for all design dimensions will be identical. In contrast, our alternative hypothesis $H_1: \exists \mu_{d_i, m}, \mu_{d_j, m}: \: \mu_{d_i, m} - \mu_{d_j, m} \neq 0, d_i, d_j \in \mathcal{D}$ asserts for each metric $m$, there exists at least one pair of design dimensions with different means. Table 4.3 shows that we may reject the null hypothesis by observing a statistically significant difference in generalization ability between design dimensions for Accuracy, AUC, F1, and Recall. However, we are unable to reject the null hypothesis for Precision.

Table 4.4: The (undirected) pairs of design dimensions with statistically significant differences in generalization ability identified by Tukey’s HSD.
Metric	Group 1	Group 2	F	p
AUC	model	pe	-0.0156	0.00021
AUC	layers	pe	-0.0122	0.0411
Accuracy	pe	se	0.0111	0.00154
Accuracy	lr	pe	-0.0116	0.00228
Accuracy	model	pe	-0.0148	5.37e-07
Accuracy	channels	pe	-0.0125	0.000705
F1	layers	model	0.0261	0.0114
F1	model	pe	-0.0277	0.000193
Recall	model	pe	-0.0273	0.00564

This insight signifies that certain design dimensions may have a significant effect (positively or negatively) on generalization ability than others. However, a limitation of the ANOVA test is that it does not tell us which dimensions have a significant effect. Having rejected the null hypothesis $H_\emptyset$ for the Accuracy, AUC, F1, and Recall metrics, we may perform a post-hoc analysis using Tukey’s Honestly Significant Difference (HSD) to better understand the impact of each design dimension on generalization ability.

4.3 Reduced Design Space

Table 4.5: Optimal design choices across all metrics for source and target datasets.
	Positional encoding	Structural encoding	Message passing network	Message passing layers	Hidden channels	Optimizer	Learning rate
Source AUC	degree	drnl	GCN	1	64	Adam	0.0001
Source Accuracy	degree	drnl	GCN	1	64	Adam	0.0001
Source F1	degree	drnl	GCN	2	64	SGD	0.0001
Source Precision	rni	de	GraphSAGE	1	64	Adam	0.001
Source Recall	degree	de+	GCN	2	32	SGD	0.0001
Target AUC	degree	drnl	GCN	1	64	Adam	0.001
Target Accuracy	degree	drnl	GCN	1	64	SGD	0.0001
Target F1	degree	drnl	GCN	2	64	SGD	0.0001
Target Precision	rni	de	GraphSAGE	1	32	Adam	0.001
Target Recall	degree	de+	GCN	2	32	SGD	0.0001

A key application of the above results is to reduce the design space. The number of experiments required to evaluate the full design space is prohibitive (if not effectively intractable with finite resources), but even with Controlled Random Search, each design dimension must still be evaluated. This can make it quite costly to run a sufficient number of trials to obtain a strong-performing model, and this cost will increase combinatorially for future work which may introduce new design dimensions beyond the scope of the current study. As such, it would be extremely valuable to eliminate the choices from each dimension which we know to not be worth exploring, greatly reducing the number of trials required to find well-performing models. Concretely, we seek a subspace $\mathcal{D}\prime \subset \mathcal{D}$ such that the average performance of models in $\mathcal{D}\prime$ is comparable or superior to those in $\mathcal{D}$. To find such a subspace, we examine the best-performing choices for each design dimension $d \in \mathcal{D}$ across all metrics on both the source and target graphs, shown in Table 4.5. Each row shows the best-performing choices across all design dimensions for a given metric. We greedily select the column space of this table as our reduced design space; that is, the set of choices for each dimension consist of the best-performing choices for each metric.

Table 4.6: A reduction of our proposed full design space obtained by greedily selecting the best-performing choices for each dimension.
Design Dimension	Choices
Positional encoding	degree, rni
Structural encoding	de, de+, drnl
Message passing network	GCN, GraphSAGE
Message passing layers	1, 2
Hidden channels	32, 64
Optimizer	Adam, SGD
Learning rate	0.0001, 0.001

The resulting design space $\mathcal{D}\prime$ is shown in Table 4.6. However, for this space to provide genuine utility as a high-quality starting point which can accelerate future research, it must be shown to produce comparable or superior performance to the full space of models. This is the goal of our third research objective, Hypothesis 1.3. Our null hypothesis $H_\emptyset: \forall m: \: \bar{m}_\mathcal{D} = \bar{m}_{\mathcal{D}\prime}$ states that the mean of each metric will be identical between the full and pruned design spaces, while our experimental hypothesis $H_1: \exists m: \: \bar{m}_\mathcal{D} - \bar{m}_{\mathcal{D}\prime} > 0$ states that there will be one or more metrics whose means are higher for the reduced space than the full space, indicating the superiority of the full space. We investigate our hypotheses by using classical random sampling to conduct $100$ trials sampled from both the full and reduced spaces, and computing performance metrics for the target graph. We determine the significance of our results using a one-way ANOVA with Bonferroni correction. Unfortunately, while we do observe that the pruned design space outperforms the full space in every metric, we are unable to reject the null hypothesis $H_\emptyset$, as the test statistics are insufficient to demonstrate statistical significance as shown by Table 4.7. We hypothesize that this is a result of our selection procedure in determining our reduced space, as it may have been too lenient in the options it preserved; as such, identifying an optimal procedure for selecting a reduced design space could be a valuable direction for future work.

Table 4.7: A reduction of our proposed full design space obtained by greedily selecting the best-performing choices for each dimension.
	full	pruned	F	p
Target Accuracy	0.6531 ($\pm 0.1303$)	0.6647 ($\pm 0.1324$)	0.3904	0.5328
Target AUC	0.7205 ($\pm 0.1559$)	0.7384 ($\pm 0.1612$)	0.6341	0.4268
Target F1	0.5515 ($\pm 0.2817$)	0.6138 ($\pm 0.2353$)	2.8661	0.092
Target Precision	0.6486 ($\pm 0.2456$)	0.6894 ($\pm 0.176$)	1.8215	0.1787
Target Recall	0.5596 ($\pm 0.3263$)	0.6442 ($\pm 0.2923$)	3.7175	0.0553

4.4 Task Difficulty

Finally, we wish to characterize the impact of the underlying data on the ability of a given model to generalize to unseen data. Our prior belief is that the more similar the source and target graphs are, the smaller the performance degredation will be. This belief is founded on the notion that models will learn certain dynamics specific to particular domains, which may make it harder to generalize. We extend the results of the experiment in Section 4.3 by computing the similarity between the source and target graphs for each trial using the Pyramid Match graph kernel [@???], which yields a real-valued similarity score $\varsigma \in [0, 1]$ for each pair of graphs. Our null hypothesis $H_\emptyset$ states that there will be no correlation between $\varsigma$ and the performance scores, while our alternative hypothesis states the contrary.

$Correlation between graph similarity and model generalization ($r=-0.06, p \leq 0.0572$).$

Figure 4.9: Correlation between graph similarity and model generalization ($r=-0.06, p \leq 0.0572$).

Interestingly, we prove unable to reject the null hypothesis $H_\emptyset$, observing a very weak Pearson correlation of just -0.06 ($p \leq 0.0572$). While a seemingly intuitive notion, we are ultimately unable to conclusively state that any datasets are more or less difficult to generalize between. This result has potentially interesting implications, as suggests the possibility that the generalization ability of ILP models may be agnostic to the underlying data, allowing greater exploration into topics such as pre-training.

Metric	Source	Target	Z	p
AUC	0.7476 (\(\pm 0.1469\))	0.7321 (\(\pm 0.1444\))	13.5506	2.45E-40
Accuracy	0.6567 (\(\pm 0.1275\))	0.6484 (\(\pm 0.1182\))	9.7255	3.67E-22
F1	0.4973 (\(\pm 0.3047\))	0.4955 (\(\pm 0.2937\))	0.9092	1.82E-01
Precision	0.7385 (\(\pm 0.2653\))	0.7245 (\(\pm 0.2481\))	5.4389	3.02E-08
Recall	0.4678 (\(\pm 0.3306\))	0.476 (\(\pm 0.3335\))	-3.3605	1.00E+00

	full	pruned	F	p
Target Accuracy	0.6531 (\(\pm 0.1303\))	0.6647 (\(\pm 0.1324\))	0.3904	0.5328
Target AUC	0.7205 (\(\pm 0.1559\))	0.7384 (\(\pm 0.1612\))	0.6341	0.4268
Target F1	0.5515 (\(\pm 0.2817\))	0.6138 (\(\pm 0.2353\))	2.8661	0.092
Target Precision	0.6486 (\(\pm 0.2456\))	0.6894 (\(\pm 0.176\))	1.8215	0.1787
Target Recall	0.5596 (\(\pm 0.3263\))	0.6442 (\(\pm 0.2923\))	3.7175	0.0553