Chapter 1 Introduction
1.1 Motivation
Complex networks are ubiquitous to daily life. From the technology powering our everyday communications and interactions to the structures that form our DNA, networks are a vital component of our everyday life. Their wide-ranging relevance and importance has spurred a large body of research focused on understanding their sophisticated dynamics. For example, modeling travel patterns as networks can help forecast traffic levels (Cui, Henrickson, Ke, & Wang, 2018; Jiang & Luo, 2021) to more accurately estimate arrival times (Derrow-Pinion et al., 2021; Fang et al., 2020; D. Wang, Zhang, Cao, Li, & Zheng, 2018) and optimize the availability of suitable alternatives (Jiang, 2022; Yao et al., 2018), or even help prevent or mitigate the spread of infectious diseases in epidemics and pandemics (Colizza, Barrat, Barthelemy, & Vespignani, 2005; Kamp, Moslonka-Lefebvre, & Alizon, 2013; So, Chu, Tiwari, & Chan, 2020). Similarly, understanding the interactions occurring in biological structures can help advance modern medicine by facilitating the discovery of novel drugs (Abbas et al., 2021; MacLean, 2021). Networks can also be used to represent our interests and preferences. Many recommendation scenarios can be framed as bipartite graphs between users and an item or product, with network-theoretic algorithms providing data-driven recommenations to enhance end-user experience by suggesting new and useful items (Daud, Ab Hamid, Saadoon, Sahran, & Anuar, 2020; Huang, Li, & Chen, 2005; Talasu, Jonnalagadda, Pillai, & Rahul, 2017).
Many of these applications are founded on the fundamental task of link prediction (Adamic & Adar, 2003; Liben-Nowell & Kleinberg, 2007; Popescul & Ungar, 2003; Taskar, Wong, Abbeel, & Koller, 2003), which concerns the identification of missing or future edges in complex networks. The importance of link prediction has led to a significant body of research towards developing effective and generally applicable techniques, from heuristic methods Zhou, Lü, & Zhang (2009) to learning-based approaches (Backstrom & Leskovec, 2010; Hasan, Chaoji, Salem, & Zaki, 2006; Leskovec, Huttenlocher, & Kleinberg, 2010), especially those based on deep learning (Islam, Aridhi, & Smaïl-Tabbone, 2020; Mutlu, Oghaz, Rajabi, & Garibay, 2020). Deep learning methods for link prediction have shown to be quite powerful, automatically learning representations for nodes, edges, and entire graphs without the need for complex feature engineering. An important caveat to this approach is that they are often transductive, meaning that they learn representations specific to each node in a graph (Grover & Leskovec, 2016; Perozzi, Al-Rfou, & Skiena, 2014; Tang, Qu, Wang, et al., 2015). This is a powerful approach as it allows the representations to be optimized end-to-end, but comes with drawbacks that limit their applicability in realistic settings where the full scope of the graph may not be visible during training. This is often the case in domains such as social networks and recommender systems, in which new users register for a service, or in temporal networks where new nodes appear over time. In these scenarios, a transductive model will be unable to generalize to the unseen nodes as their representations were not optimized during training. While there are solutions (Ma, Guo, Ren, Tang, & Yin, 2020; Perozzi et al., 2014), they are often complex or computationally infeasible to apply in practice, as they require additional training before performing inference. In contrast, inductive approaches (Hamilton, Ying, & Leskovec, 2017; Velickovic et al., 2018) use existing numeric properties (or attributes) associated with each node instead of associating a learnable, unique representation; for example, representing a node (an academic paper) in a citation network as a binary vector \(\mathbf{x} \in \mathbb{Z}^{|V|}\) where \(\mathbf{x}_j\) equals \(1\) if the paper contains the word \(V_j\) from vocabulary \(V\), otherwise \(0\). This approach is much simpler, but such properties are not always available in practice, limiting their applicability as well.
An ideal solution for link prediction in realistic settings would remedy these challenges by automatically and inductively learning node representations end-to-end using the structure of the graph, such that the same model maintains effectiveness in the presence of new nodes, edges, and potentially even across entirely different graphs. This ideal model would open the door to new advancements in many real-world applications of link prediction by providing a general, flexible approach capable of handling unseen data without depending on the presence of node attributes. As such, this thesis will focus on the questions and considerations surrounding the construction of a fully inductive, learning-based approach to link prediction, which we detail in the following section.
1.2 Research Objectives
The overarching goal of this thesis is to quantify the ability of inductive link prediction (ILP) models to generalize to unseen data. The objectives of this work are formalized below. In particular, we will evaluate the ability of ILP models to generalize to an entirely new graph, the most challenging inductive learning scenario in which all test-time data is completely novel. In this section, we lay out our hypotheses concerning different aspects of the proposed task, as well as the proposed methodology and measure(s) to sufficiently answer the question.
Hypothesis 1.1 (How well do ILP models generalize to unseen data?) The primary objective of this thesis is to assess the ability of inductive link prediction models to generalize to unseen data. We will quantify this by computing performance metrics (such as AUC, accuracy, or F1-score) for a given model on both seen and unseen data, and measuring (1) the raw generalization ability as shown by performance on unseen data and (2) the relative generalization ability as shown by percentage increase or decrease in performance from seen to unseen data. We will consider these measures at an aggregate level to determine the overall generalization ability of the model architectures within our proposed framework. We hypothesize that in the aggregate, ILP models will exhibit a decrease in performance when applied to unseen data.
Hypothesis 1.2 (What components of an ILP model affect its ability to generalize to unseen data?) It follows from our first objective that there will be variation in ILP models’ ability to generalize. We will perform an in-depth examination by investigating how generalization varies depending on the choice of various components of our framework (e.g., the choice of message passing network). This will be measured by examining the results from RQ1 on an individual level and analyzing how raw and relative generalization is affected by each choice. Our prior belief is that most facets of our framework will demonstrate fairly small differences in performance, but that the components pertaining to constructing initial node representations (positional and structural encodings) will contain more variation between choices and will be the most impactful on generalization overall.
Hypothesis 1.3 (Is there a subset of ILP model architectures that perform better on unseen data?) To derive actionable insights from our work, we will show the existence of a compact set of ILP model architectures which exhibit better generalization ability relative to the full space of possible architectures. We will show this by using the results of RQ2 to obtain a set of optimal architectures, which we will evaluate in the same way as in RQ1. We will compare the results obtained from the reduced space of architectures to the results obtained from the full space in RQ1 to validate the optimality of our proposed reduced space. Our hypothesis is that our reduced space of optimal architectures will demonstrate a moderate, statistically significant improvement relative to the full space.
Hypothesis 1.4 (Are some datasets more difficult to generalize to than others?) Finally, we wish to examine if all ILP tasks are made equal - that is, if it is harder for models to generalize to some datasets than others. Understanding the difficulty of a given dataset is important for researchers and practitioners alike, in order to properly contextualize the performance measures obtained from their experiments. The knowledge that it is easy to generalize between, for example, product co-purchase networks indicates to researchers that it would not sufficient to only examine graphs from that domain and that more diverse datasets are required, and signifies to a practitioner developing a recommender system that applying online or continual learning techniques may be unnecessary as ILP models would be able to generalize to unseen users with strong performance. We will measure this by evaluating relative generalization in two scenarios: intra- and inter-domain transfer. Intra-domain transfer will compare generalization between two different graphs in the same domain (e.g., between citation networks), while inter-domain transfer will compare generalization between graphs from different domains (e.g., from a citation network to a social network). Crucially, we will also compute the similarity of each pair of graphs in order to quantify the relationship between graph similarity and model generalization. We expect that the more similar two graphs will be, the better the relative generalization; that is, we expect that models will be able to generalize more easily when the two graphs are more similar, and less easily when the two graphs are dissimilar.
| Objective | Hypothesis | Measures |
|---|---|---|
| RQ1 | ILP models will exhibit a decrease in performance when applied to unseen data. | Aggregated raw and relative performance measures (AUC, accuracy, F1) |
| RQ2 | Model components which construct initial node representations, such as positional and structural encodings, will show the highest variance and be the most impactful on generalization. | Individual raw and relative performance measures |
| RQ3 | There exists a reduced subspace of model architectures with superior ability to generalize to unseen data. | Comparison of aggregated measures between full and reduced model spaces |
| RQ4 | Models will be able to generalize more easily between similar graphs and less easily between dissimilar graphs. | Correlation between graph similarity and aggregate relative measures |
The contents of the thesis will focus on comprehensively answering the above questions. We will first lay the foundations of our work by defining important concepts and notations in Section 2.1, followed by a brief background of the applications of machine learning to graphs in Section 2.2. Sections 2.3 and 2.4 will review the existing body of work concerning deep learning methods for graph representation learning, particularly for link prediction. With the groundwork solidified, Section 3 will detail the experimental methodology used to answer the listed research questions. We begin by proposing a standardized framework for inductive link prediction based on modern graph representation learning (GRL) techniques - such as positional and structural encodings and message passing layers - which is both effective and efficient; we describe the cross-graph ILP task, which we use to evaluate each model’s ability to generalize in the most difficult setting; we then utilize the Randomized Controlled Search method (You, Ying, & Leskovec, 2020) to efficiently analyze the various model instances within our framework and identify the optimal choices for each design dimension; and finally, we demonstrate that our proposed set of optimal architectures contains a higher concentration of well-performing models in comparison to the full “design space” (You et al., 2020) of model architectures, accelerating future research in the area and decreasing practitioners’ barrier to entry by reducing the search space of ideal models. The results of our experiments are discussed in Section 4, wherein we determine the validity of our hypotheses and discuss its implications. We then describe the software developed throughout this thesis in Section 5, before concluding with a review of our contributions which can be summarized as follows:
- We conduct a novel study on the generalization abilities of numerous model architectures for the inductive link prediction task. To the best of our knowledge, no previous work has performed a comprehensive evaluation of ILP models. Not only do we evaluate the effectiveness of many architectures, we go further by analyzing specific model components in order to identify key design choices to construct effective ILP models. We also facet our analysis by metric, demonstrating how the best model depends on the task and specific objective at hand.
- We propose a standardized framework for inductive link prediction based on positional and structural encodings, message passing networks, and subgraph pooling. The proposed framework is efficient, effective, and generalizable, making it usable for real-world applications. This framework encapsulates the many model architectures evaluated as part of our study, allowing us to zero-in on the most important facets of the framework and the best choices to maximize performance.
- We identify an optimal subspace of model architectures that perform well on the inductive link prediction task. This greatly reduces the number of experiments required to find a well-performing model, benefitting both practitioners and researchers by reducing the barrier to entry in running their own experiments.
- We analyze the effect of the underlying data on generalization ability. The difficulty of generalizing between all pairs of graphs may not necessarily be equal, especially between particularly (dis-)similar graphs. To contextualize our observed results, we compare aggregated performance measures to the similarity between the source and target graphs, allowing us to quantify the impact of the underlying data and identify which tasks are easier or harder than others.