Chapter 4 Evaluation
4.1 Methodology
| Dataset | Domain | # Graphs | # Nodes | # Edges | Degree | Density |
|---|---|---|---|---|---|---|
| Planetoid | Citation | 3 | 8584.00 | 36102.00 | 3.71 | 0.0017 |
| PPI | Chemistry | 2 (20) | 2495.50 | 66483.00 | 24.89 | 0.0200 |
| Amazon | Product | 2 | 10701.00 | 364942.00 | 33.44 | 0.0067 |
| Twitch | Social Network | 4 (6) | 5059.25 | 124394.75 | 26.65 | 0.1515 |
4.1.0.0.1 Datasets
Throughout all of our experiments, we consider graphs from four datasets spanning different domains - Planetoid (paper citations), PPI (protein-protein interaction), Amazon (product co-purchase), and Twitch (social relationships). Using graphs from multiple domains helps ensure the generalizability of our results between graphs of different sizes, densities, and interaction dynamics. We provide a brief overview of the datasets in Table 4.1. The number of graphs in each dataset is shown in the third column; for PPI and Twitch, we sample from the total number of graphs in the dataset which is enclosed in parentheses. The following columns show the average number of nodes, number of edges, degree, and density found in the graphs used in our experiments for each domain.
4.1.0.0.2 Experimental Procedure
We prepare the data for each trial by partitioning each graph’s edges into message passing, training, validation, and testing sets. We reserve \(10\%\) of edges for validation and \(20\%\) for testing. We compute global encodings using only the training edges to prevent data leakage. We follow our proposed standardized model architecture comprised of global and local encodings, message passing layers, and subgraph pooling as described in Section 3.1. Each model is trained by minimizing the Binary Cross-Entropy (BCE) loss for a maximum of \(2500\) steps and evaluated on the complete set test. Our initial threshold for statistical significance is \(\alpha = 0.05\) for all hypothesis tests.
4.2 Evaluation of Model Generalization
4.2.1 Aggregate analysis
Figure 4.1: Distribution of performance metrics on the cross-graph link prediction task.
Our first object Hypothesis 1.1 asserts that on the aggregate, ILP models will demonstrate a moderate decrease in performance when applied to new data. This hypothesis is tested by conducting a large number of trials in which various performance measures are obtained on the test edges for both the source and target graphs. In particular, we Controlled Random Search (CRS) for each of our proposed design dimensions, and aggregate the results to assess the overall behavior of ILP models. With \(|S|=25\) and each trial repeated across \(3\) different random seeds, we ultimately conduct \(2,025\) trials, obtaining the AUC, Accuracy, F1, Precision, and Recall measures for the test edges on both the source and target graphs.
| Metric | Source | Target | Z | p |
|---|---|---|---|---|
| AUC | 0.7476 (\(\pm 0.1469\)) | 0.7321 (\(\pm 0.1444\)) | 13.5506 | 2.45E-40 |
| Accuracy | 0.6567 (\(\pm 0.1275\)) | 0.6484 (\(\pm 0.1182\)) | 9.7255 | 3.67E-22 |
| F1 | 0.4973 (\(\pm 0.3047\)) | 0.4955 (\(\pm 0.2937\)) | 0.9092 | 1.82E-01 |
| Precision | 0.7385 (\(\pm 0.2653\)) | 0.7245 (\(\pm 0.2481\)) | 5.4389 | 3.02E-08 |
| Recall | 0.4678 (\(\pm 0.3306\)) | 0.476 (\(\pm 0.3335\)) | -3.3605 | 1.00E+00 |
We use a paired-sample one sided \(t\)-test with the null hypothesis \(H_\emptyset: \forall m: \: \bar{m}_s = \bar{m}_t\), that the means of the source and target samples for each metric are identical. The alternative hypothesis \(H_1: \exists m: \: \bar{m}_s - \bar{m}_t < 0\) states that there exists some metric for which performance decreases from the source graph to the target. Of course, evaluating these hypotheses repeatedly for multiple metrics increases our chance of Type-I error. We control for this using Bonferroni correction, which sets our threshold for statistical significance at \(\alpha = \frac{0.05}{5} = 0.01\). Under this regime, we find a statistically significant decrease in AUC, Accuracy, and Precision when applying ILP models to the cross-graph link prediction task. However, we do not find a significant difference in F1 or Recall. In spite of this, we nonetheless reject the null hypothesis \(H_\emptyset\) by showing the existence of a performance decrease in one or more metrics, with the relevant significance measures and test statistics shown in Table 4.2.
Figure 4.1 visualizes the distribution of each metric in both the source and target settings, showing that on average, mass was placed higher for the source metrics than their corresponding target metrics. This result indicates that practitioners may well experience difficulties when applying link prediction models to unseen data; however, it is still possible to achive comparable (albeit slightly diminished) performance. This result also signifies the need for models which are better able to generalize to unseen data to mitigate these difficulties. In addition, we can also observe that in both settings, models are able to achieve strong Precision but struggle with Recall, which may not be suitable for all applications. The existing literature has paid little attention to recall-oriented measures, which is seemingly reflected in these results and a potential direction for future work exploring the development of recall-optimized architectures.
4.2.2 Performance by Design Dimension
The following section will examine the performance of each choice for each design dimension, across all metrics on both source and target datasets to comprehensively capture the strengths and weakness of each.
4.2.2.0.1 Positional encoding
Figure 4.2 shows the surprising result that none of the positional encoding methods are as effective in this setting as a simple degree-based encoding. The degree-based PE outperforms all others in all metrics on both source and target graphs. Overall, the top three best-performing methods - degree, random node initialization, and random walk - can also arguably be considered the simplest methods, suggesting that high-powered positional encodings do not demonstrably improve performance relative to more lightweight alternatives, and are sometimes even worse than not using a positional encoding at all.
Figure 4.2: Ranking analysis of positional encodings across all metrics. Lower is better.
4.2.2.0.2 Structural encoding
Among structural encodings, it is clear that Double Radius Node Labeling (DRNL) and Distance Encoding Plus (DE+) are markedly superior to Distance Encoding (DE+), as shown by Figure 4.3. Interestingly, there is a pronounced difference between all three options and None, indicating that using a structural encoding is vital for inductive link prediction.
Figure 4.3: Ranking analysis of structural encodings across all metrics. Lower is better.
4.2.2.0.3 Message passing network
It is shown by 4.4 that the Graph Convolutional Network (GCN) consistently outperforms the alternatives in terms of AUC, accuracy, F1, and recall. However, GraphSAGE produces the highest precision, providing a good example of the importance of considering performance across multiple metrics. These findings suggest that most practitioners in general should opt for the GCN, but that those focused on tasks requiring high precision should actually choose GraphSAGE instead.
Figure 4.4: Ranking analysis of message passing networks across all metrics. Lower is better.
4.2.2.0.4 Message passing layers
When considering the number of message passing layers used, a less is more effect is found. Figure 4.5 shows that in most cases, it seems that one message passing layer is sufficient, if not outright superior, and in almost every case using three message passing layers hampers performance. We believe that this is likely the effect of the well-studied oversmoothing phenomenon in message passing networks, in which node embeddings tend to converge to a similar point as the number of message passing rounds increases, ultimately making it difficult to distinguish positive and negative subgraphs.
Figure 4.5: Ranking analysis of message passing layers across all metrics. Lower is better.
4.2.2.0.6 Optimizer
Figure 4.7 paints a somewhat unclear picture of which optimizer is best for the inductive link prediction task. Adam is superior in AUC and precision, while SGD is stronger in F1 and recall, and both are comparable in terms of accuracy. This aligns with findings from previous work that Adam is generally comparable or superior to SGD in a naive setting, though SGD can be better when tuned. Though the outright best choice is unclear, our findings help guide practitioners by demonstrating the nuanced differences between the two on a per-metric basis.
Figure 4.7: Ranking analysis of optimizers across all metrics. Lower is better.
4.2.2.0.7 Learning rate
Finally, it is seen in 4.8 that a lower learning rate of \(0.0001\) is generally better for performance. In addition, we again observe a difference precision, which is increased with a slightly higher learning rate of \(0.001\).
Figure 4.8: Ranking analysis of learning rate across all metrics. Lower is better.
4.2.3 Dimension-level effects
Having considered the intra-dimension differences between the choices for each design dimension, we now turn our attention to inter-dimension differences. For Hypothesis 1.2, we wish to establish if any design dimensions make a statistically significant impact on generalization ability. Knowledge of such a difference would inform us that certain dimensions may be more important than others when identifying well-generalizing models, allowing us to allocate compute resources more efficiently by spending less time optimizing dimensions that do not significantly impact performance. Note that this does not speak to the importance of the dimension itself, but rather the range of the choices for a given dimension; for example, the choice in learning rate may be very important, but there may not be a significant difference between the examined values \(0.01\), \(0.001\), and \(0.0001\).
| F | p | |
|---|---|---|
| Accuracy | 6.8304 | 3.48E-07 |
| AUC | 3.9559 | 6.14E-04 |
| F1 | 4.0272 | 5.14E-04 |
| Precision | 1.2034 | 3.01E-01 |
| Recall | 3.1296 | 4.69E-03 |
In order to test for the existence of a difference in performance metrics across design dimensions, we use a one-way ANOVA with Bonferroni correction (\(p = \frac{0.05}{5} = 0.01\)). We use the same observations from Section ??, in which we use Controlled Random Search to sample \(25 \cdot |d|\) trials for each design dimension \(d \in \mathcal{D}\) and repeating each trial with \(3\) random seeds, resulting in \(2,025\) total trials. For all metrics from each trial we compute the generalization score as the difference between the metric on the source and target graphs. Our null hypothesis \(H_\emptyset: \forall \mu_{d_i, m}, \mu_{d_j, m}: \: \mu_{d_i, m} - \mu_{d_j, m} = 0, d_i, d_j \in \mathcal{D}\) asserts that for each metric \(m\), the mean generalization ability for all design dimensions will be identical. In contrast, our alternative hypothesis \(H_1: \exists \mu_{d_i, m}, \mu_{d_j, m}: \: \mu_{d_i, m} - \mu_{d_j, m} \neq 0, d_i, d_j \in \mathcal{D}\) asserts for each metric \(m\), there exists at least one pair of design dimensions with different means. Table 4.3 shows that we may reject the null hypothesis by observing a statistically significant difference in generalization ability between design dimensions for Accuracy, AUC, F1, and Recall. However, we are unable to reject the null hypothesis for Precision.
| Metric | Group 1 | Group 2 | F | p |
|---|---|---|---|---|
| AUC | model | pe | -0.0156 | 0.00021 |
| AUC | layers | pe | -0.0122 | 0.0411 |
| Accuracy | pe | se | 0.0111 | 0.00154 |
| Accuracy | lr | pe | -0.0116 | 0.00228 |
| Accuracy | model | pe | -0.0148 | 5.37e-07 |
| Accuracy | channels | pe | -0.0125 | 0.000705 |
| F1 | layers | model | 0.0261 | 0.0114 |
| F1 | model | pe | -0.0277 | 0.000193 |
| Recall | model | pe | -0.0273 | 0.00564 |
This insight signifies that certain design dimensions may have a significant effect (positively or negatively) on generalization ability than others. However, a limitation of the ANOVA test is that it does not tell us which dimensions have a significant effect. Having rejected the null hypothesis \(H_\emptyset\) for the Accuracy, AUC, F1, and Recall metrics, we may perform a post-hoc analysis using Tukey’s Honestly Significant Difference (HSD) to better understand the impact of each design dimension on generalization ability.
4.3 Reduced Design Space
| Positional encoding | Structural encoding | Message passing network | Message passing layers | Hidden channels | Optimizer | Learning rate | |
|---|---|---|---|---|---|---|---|
| Source AUC | degree | drnl | GCN | 1 | 64 | Adam | 0.0001 |
| Source Accuracy | degree | drnl | GCN | 1 | 64 | Adam | 0.0001 |
| Source F1 | degree | drnl | GCN | 2 | 64 | SGD | 0.0001 |
| Source Precision | rni | de | GraphSAGE | 1 | 64 | Adam | 0.001 |
| Source Recall | degree | de+ | GCN | 2 | 32 | SGD | 0.0001 |
| Target AUC | degree | drnl | GCN | 1 | 64 | Adam | 0.001 |
| Target Accuracy | degree | drnl | GCN | 1 | 64 | SGD | 0.0001 |
| Target F1 | degree | drnl | GCN | 2 | 64 | SGD | 0.0001 |
| Target Precision | rni | de | GraphSAGE | 1 | 32 | Adam | 0.001 |
| Target Recall | degree | de+ | GCN | 2 | 32 | SGD | 0.0001 |
A key application of the above results is to reduce the design space. The number of experiments required to evaluate the full design space is prohibitive (if not effectively intractable with finite resources), but even with Controlled Random Search, each design dimension must still be evaluated. This can make it quite costly to run a sufficient number of trials to obtain a strong-performing model, and this cost will increase combinatorially for future work which may introduce new design dimensions beyond the scope of the current study. As such, it would be extremely valuable to eliminate the choices from each dimension which we know to not be worth exploring, greatly reducing the number of trials required to find well-performing models. Concretely, we seek a subspace \(\mathcal{D}\prime \subset \mathcal{D}\) such that the average performance of models in \(\mathcal{D}\prime\) is comparable or superior to those in \(\mathcal{D}\). To find such a subspace, we examine the best-performing choices for each design dimension \(d \in \mathcal{D}\) across all metrics on both the source and target graphs, shown in Table 4.5. Each row shows the best-performing choices across all design dimensions for a given metric. We greedily select the column space of this table as our reduced design space; that is, the set of choices for each dimension consist of the best-performing choices for each metric.
| Design Dimension | Choices |
|---|---|
| Positional encoding | degree, rni |
| Structural encoding | de, de+, drnl |
| Message passing network | GCN, GraphSAGE |
| Message passing layers | 1, 2 |
| Hidden channels | 32, 64 |
| Optimizer | Adam, SGD |
| Learning rate | 0.0001, 0.001 |
The resulting design space \(\mathcal{D}\prime\) is shown in Table 4.6. However, for this space to provide genuine utility as a high-quality starting point which can accelerate future research, it must be shown to produce comparable or superior performance to the full space of models. This is the goal of our third research objective, Hypothesis 1.3. Our null hypothesis \(H_\emptyset: \forall m: \: \bar{m}_\mathcal{D} = \bar{m}_{\mathcal{D}\prime}\) states that the mean of each metric will be identical between the full and pruned design spaces, while our experimental hypothesis \(H_1: \exists m: \: \bar{m}_\mathcal{D} - \bar{m}_{\mathcal{D}\prime} > 0\) states that there will be one or more metrics whose means are higher for the reduced space than the full space, indicating the superiority of the full space. We investigate our hypotheses by using classical random sampling to conduct \(100\) trials sampled from both the full and reduced spaces, and computing performance metrics for the target graph. We determine the significance of our results using a one-way ANOVA with Bonferroni correction. Unfortunately, while we do observe that the pruned design space outperforms the full space in every metric, we are unable to reject the null hypothesis \(H_\emptyset\), as the test statistics are insufficient to demonstrate statistical significance as shown by Table 4.7. We hypothesize that this is a result of our selection procedure in determining our reduced space, as it may have been too lenient in the options it preserved; as such, identifying an optimal procedure for selecting a reduced design space could be a valuable direction for future work.
| full | pruned | F | p | |
|---|---|---|---|---|
| Target Accuracy | 0.6531 (\(\pm 0.1303\)) | 0.6647 (\(\pm 0.1324\)) | 0.3904 | 0.5328 |
| Target AUC | 0.7205 (\(\pm 0.1559\)) | 0.7384 (\(\pm 0.1612\)) | 0.6341 | 0.4268 |
| Target F1 | 0.5515 (\(\pm 0.2817\)) | 0.6138 (\(\pm 0.2353\)) | 2.8661 | 0.092 |
| Target Precision | 0.6486 (\(\pm 0.2456\)) | 0.6894 (\(\pm 0.176\)) | 1.8215 | 0.1787 |
| Target Recall | 0.5596 (\(\pm 0.3263\)) | 0.6442 (\(\pm 0.2923\)) | 3.7175 | 0.0553 |
4.4 Task Difficulty
Finally, we wish to characterize the impact of the underlying data on the ability of a given model to generalize to unseen data. Our prior belief is that the more similar the source and target graphs are, the smaller the performance degredation will be. This belief is founded on the notion that models will learn certain dynamics specific to particular domains, which may make it harder to generalize. We extend the results of the experiment in Section 4.3 by computing the similarity between the source and target graphs for each trial using the Pyramid Match graph kernel [@???], which yields a real-valued similarity score \(\varsigma \in [0, 1]\) for each pair of graphs. Our null hypothesis \(H_\emptyset\) states that there will be no correlation between \(\varsigma\) and the performance scores, while our alternative hypothesis states the contrary.
Figure 4.9: Correlation between graph similarity and model generalization (\(r=-0.06, p \leq 0.0572\)).
Interestingly, we prove unable to reject the null hypothesis \(H_\emptyset\), observing a very weak Pearson correlation of just -0.06 (\(p \leq 0.0572\)). While a seemingly intuitive notion, we are ultimately unable to conclusively state that any datasets are more or less difficult to generalize between. This result has potentially interesting implications, as suggests the possibility that the generalization ability of ILP models may be agnostic to the underlying data, allowing greater exploration into topics such as pre-training.