Neural Scaling Laws on Graphs, do you believe is there strong related between model , data size and infrence performance in graph industry?

# [Contact Info]

Gmail : jeongiitae6@gmail.com

# Neural Scaling Laws on Graphs

[Content]

- The size and performance of a model, the size of training data and performance, and whether performance improves with more training are all questions addressed by a field known as neural scaling laws.
- Neural scaling has been extensively researched in the fields of natural language processing and image processing, leading to the development of outcomes like ChatGPT. The principle that more and better quality data leads to better results has been well applied in these cases. However, does this law apply to graphs as well?
- Unlike text and images used in natural language processing and image processing, graphs are unstructured and operate within non-Euclidean spaces, possessing a flexible structure, making their data type fundamentally different. In other words, the characteristics inherent to the data, that is, the model parameters, vary widely across different datasets.
- For example, in text processing, sentences are segmented into words, all of which are interconnected, and these relationships are utilized for learning. This means there aren’t cases where some words have many connections while others have few, preventing uneven distributions.
- Similarly, in image processing, specific images are divided into patches, and the relationships between these patches are learned to understand the context, making uneven distributions rare.
- However, what about graphs? Graphs often exhibit uneven distributions, with some nodes having many connections and others having few. Additionally, considering three-dimensional spaces such as molecular structures or protein configurations, which extend beyond the two-dimensional x,y plane, adds complexity and nuance to what needs to be represented.
- In summary, unlike the uniform distributions found in text and image processing, graphs present non-uniform distributions and the necessity to represent three-dimensional spaces. Therefore, to explore neural scaling laws, one must consider these non-uniform distributions and additional dimensional complexities.
- The paper’s authors aim to examine when, where, and how these laws apply and why certain phenomena occur, from both a ‘model’ and ‘data’ perspective.
- To validate neural scaling laws, one approach is to use model parameters, training dataset size, and amount of computation as independent variables, with test error as the dependent variable.
- An obvious question arises regarding the relationship between model parameters and training datasets, and the potential for overfitting. Interestingly, there are papers that have addressed this question, finding an inverse relationship, which the authors have included in their related work. These papers suggest that exceeding a certain ratio of model parameters to dataset size can negatively impact performance.
- To investigate how model performance varies with dataset size and model size, experiments are conducted by categorizing model size into depth and width, fixing model depth while adjusting model width. Here, depth refers to the number of model layers, and width refers to the model parameters. For instance, a model with a depth of 6 and a width of 1024 means there are six layers of neurons, each neuron learning weights in a 1024-dimensional space.

## Insight

Following observation of in this paper explained their insight kindly splited into them each section.

**Observation 1. The model and data scaling behaviors in the graph domain can be described by neural scaling laws.**

- Do you remember the part where I mentioned the three factors and performance constructed in the form of an equation at the beginning? It has been found that this equation is meaningful. Through this, it is discussed that in graph tasks, as the size of the model and dataset increases, performance can also improve proportionally, and its potential is mentioned.

**Observation 2. The model scaling collapses in the graph domain often happen due to the overfitting.**

- The paper mentions that graph datasets are relatively less numerous compared to natural language and image datasets, leading to a higher probability of overfitting. This is attributed to what is called a model scaling collapse. Simply put, model scaling collapse occurs due to oversmoothing. Model scaling collapse refers to the phenomenon where the size of the model relative to the size of the training data is too large, resulting in a failure to improve model performance.

**Observation 3. Deep graph models with varying depths will have different model scaling behaviors.**

- Observation 3 in the paper demonstrates that the principle of case-by-case basis, which is a common truth in deep learning, also applies to graph deep learning. This is because there is no optimal formula that dictates a certain model is always the best under any circumstances, specific situations, tasks, or models.
- As a result, a wide range of experiments are conducted without prejudice. The outcomes reveal that different models exhibit various model scaling behaviors.
- For instance, certain models show an increase in layers and parameters leading to improved performance up to a certain point, after which performance declines, whereas other models demonstrate improved performance from the start regardless of the number of layers.
- Experiments in Figures 4, 5, and 6 vary training epochs, parameters, and model architecture. In simpler terms, the approach was to extensively experiment by conducting more training sessions, increasing dimensions, changing models, adding more model layers, etc., essentially testing all possible scenarios. The results from these experiments led to the derivation of Observations 2 and 3.

**Observation 4. The number of edges is a better data metric for data scaling law compared to the number of graphs.**

- Since each graph has its own unique topology (the number of edges connected to each node), the paper discusses how constructing training data based on the graph can be inefficient for graph tasks. To improve this, it suggests that training data should be structured considering the complexity within the graph. The results of testing this hypothesis are presented in Figure 8 and Figure 9.
- Even with the same parameters, it has been proven that graph topologies with complex structures perform better. These outcomes indicate that the composition of training data, which adequately reflects the complexity of graphs, is crucial for enhancing model performance in graph-based tasks.
- Therefore, optimizing training data by considering the complexity within the graph suggests an effective approach for solving graph tasks. This method could serve as a foundation for achieving better results in various applications using graph data in the future.

**Observation 5. Node classification, link prediction, and graph classification tasks could follow the same form of data scaling law with the number of edges as data metric.**

- Until now, the focus has been primarily on graph tasks, but the authors of the paper decided to explore, as a bonus, whether the principle that complex graph structures lead to higher performance also applies to node classification and link prediction. They found that the principle indeed applies similarly.
- This insight is particularly valuable when working with large-scale graphs and considering the timing for experiments. I used to simply observe the distribution of graphs and understand the data structure before proceeding with training. However, moving forward, this paper will likely serve as a frequent reference for me to conduct thought experiments about the likelihood of overfitting with the current data I have. Nonetheless, given that the paper focuses extensively on the graph classification task, there seems to be much to consider in the context of other tasks such as node classification and link prediction.