# Graph Foundation Models

7 min readMar 10, 2024

# [Contact Info]

Gmail : jeongiitae6@gmail.com

Linkedin : https://www.linkedin.com/in/ii-tae-jeong/

# [Contents]

PaperLink : https://arxiv.org/pdf/2402.02216.pdf

## Keywords

graph foundation model , formula for success graph deep learning , graph technique essential challenging and approach.

# Foundation model

- The term “Foundation Model” is being frequently mentioned in the field of artificial intelligence. It seems that the cause of this mention could be attributed to the exponential increase in models and data available for use, as computing resources grow. I believe the trend has shifted from optimizing a few models and data with limited resources to optimizing which options to effectively choose and use, due to the abundance of models and data available.
- Today’s paper addresses not the improvement of architectures covered in previous Omakase but focuses on the survey nature of how to design a Foundation Model. It would be greatly helpful for those who are looking to design an efficient Foundation Model within their organizations. Rather than simply stating that large data sets and popular baseline architectures are best, it discusses why Foundation Models have succeeded in the fields of natural language processing and computer vision, and why applying these reasons to the field of graph could be beneficial. The paper is easy to understand and credible, containing key points that are well encapsulated. It is recommended to use this as a to-do list when designing a Graph Foundation Model.
- To start with a definition of a Foundation Model, it refers to a model that efficiently extracts knowledge from a pre-training model and utilizes it to perform excellently in generalization across various domains/tasks in down-stream tasks. In other words, it’s a sort of Swiss Army knife model.
- A crucial condition for creating this universal model is the Vocabulary. The core lies in how well the unique information, or Knowledge, that the Foundation Model possesses is designed.
- In the field of natural language processing, vocabulary is utilized to divide and understand a given sentence into words, phrases, and symbols. In the field of computer vision, it is used to divide images into tokens and map which token belongs to which in the vocabulary. Then, how is Vocabulary utilized in Graphs?

# Graph Data and Graph Foundation model

- Before discussing Graph Vocabulary, let’s first talk about the advantages and fundamental limitations of graph data, which is essentially the ‘connection’. How things are connected is key to graph data, and this connection can be interpreted from both positive and negative perspectives.
- From a positive standpoint, the fact that information can be transmitted based on connections means that users can arbitrarily adjust the weight of information, that is, they can design the message passing process as they wish. This flexibility in information transmission can be seen as a positive aspect.
- Conversely, from a negative perspective, because the graph’s information transmission system can be arbitrarily designed, subjectivity may intervene. Since nodes that are ‘connected’ exchange information, the structure depends heavily on these connections, which can subjectively have either positive or negative effects. Additionally, it can be challenging to differentiate between nodes and edges within the graph.

- Based on Figure 1, I’ll explain this perspective. Both (v1, v2) and (v3, v4) share the commonality of having two links each. However, (v1, v2) belong to the left subgraph, and (v3, v4) to the right subgraph, which distinguishes them.
- We can easily identify these differences through the figure and data. But, if we ask a computer to represent (v1, v2) and (v3, v4) with quantitative embedding values, it’s likely they will be represented by the same numbers due to the identical structures.
- This indicates the need for additional mechanisms to express that, although the structures are the same, they carry different contexts, which can be seen as a significant drawback and fundamental limitation of graph data.
- Therefore, if we can convey to a deep learning model that the structures are the same but the contexts differ, there is a high possibility of performance improvement.
- This concept of ‘adequately representing different contexts’ is referred to as Transferability in the Graph Foundation Model. In essence, based on the Foundation Model, it involves distinguishing the contexts for various data sets and expressing how they differ.
- Organizing this differentiated information systematically in a Vocabulary and utilizing it at the appropriate time is key to the essence of Graph Foundation.

# Graph Vocabulary with Positive / Negative Transferability

- How can we measure whether this distinction is made well or not? In this paper, we present guidelines for Graph Vocabulary tailored to graph tasks such as node classification, link prediction, and graph classification from three perspectives: Network Analysis, Expressiveness, and Stability.
- Network science is utilized to check the approximate distribution of data since it can quantify patterns. From the perspective of transferability, graphs with similar structures are likely to share similar characteristics, so this distribution can be used to gauge whether transferability will work positively or negatively.
- Expressiveness deals with how to ensure the uniqueness of nodes and links within a graph. From the perspective of transferability, the uniqueness varies depending on the node-edge combinations, so the key is to secure this uniqueness in high quality and create a vocabulary of diverse patterns.
- If the function for converting into a vocabulary form is varied, ultimately, when a similar pattern is input, a meta is specified according to that pattern, and this meta is utilized during inference.
- Stability focuses on setting boundaries for how much data conversion is guaranteed when converting graph data into vector form. Thinking more deeply about Expressiveness from the perspective of isomorphism, including minor perturbations, is key.
- From the perspective of transferability, adding fine-grained constraints up to the perturbation perspective in Expressiveness means that the metas become more diverse. As these metas become more varied, more diverse patterns can be inferred, which is the context that the performance will be further enhanced in terms of generalization, which is central to the Foundation model.

# Additional consideration both Neural scaling law , model, data scaling and pretext task design

- While the previous section dealt with how to effectively quantify core patterns of graphs, this section discusses improvements in graph deep learning models from the perspectives of data and model scale as well as task design.

## Situations where the neural scaling law is effective and when it’s not

- As mentioned in the context of Transferability, the neural scaling law is based on the discovery that data with similar structural distributions will ultimately exhibit similar characteristics, suggesting that generalization can enhance performance. In other words, there are universal situations where the neural scaling law can be applied.
- However, this universal application differs when human intervention is involved in directly designing graphs and specifying tasks, i.e., setting nodes and links and determining labels for inference. In such cases, using this formula may be inappropriate.
- For example, OGBN-ARXIV (Hu et al., 2020) and ARXIV-YEAR (Lim et al., 2021) are two node classification datasets with identical graph information. The only difference is that OGBN-ARXIV uses paper categories, while ARXIV-YEAR uses publication years as labels, leading to conflicting properties of homophily and heterophily (Mao et al., 2023a).

## Designing Pretext Tasks to Compensate for Scarcity of Label Data

- This part addresses how to approach and solve the issue when labeling data is relatively scarce. It discusses graph contrastive learning, generative self-supervised learning, and next token prediction, explaining why these three methodologies can help improve performance.
- Additionally, from the perspective of neural scaling laws, it covers data scaling, heterogeneity, synthetic graphs, and model scaling. For more detailed information, references to papers or discussions in the fourth week of February’s Graph Omakase, where similar contexts were explored in greater depth, will be provided.

# insight

## When to Use Graphs with LLMs

- The key is to convey information based on the holistic view of graphs in addition to the token structural pattern based on the sequence of LLMs. Since LLMs are implemented based on transformers, the tokens are trained in a fully connected graph form.

Although it is said that a graph perspective has been applied, strictly speaking, it can be considered as being trained with weights in a form where only the types of nodes differ in a heterogeneous graph (a graph with different types of nodes),but the types of edges are all the same. - Eventually, the graph can be supplemented by embedding a semantic perspective, where semantic context is reflected and learned in each edge.
- Additionally, insights on how to adopt the subgraph method, what might be good pre-text tasks and architecture design, and why GFM (Graph Foundation Model) is necessary are included, so I recommend taking a look.

# Recap

- We have explored from Foundation Models to Graph Foundation Models. We discussed how to plan and apply foundation models in graph data, how to generalize graphs from the perspective of transferability, and finally, how to handle model and data scaling from a deep learning aspect.
- When you want to design and utilize Foundation Models within your company and feel uncertain about which process to follow, this paper can serve as an excellent educational guide and guideline.