YOU MUST EAT THIS GRAPH NEWS, GRAPH OMAKASE. 4 weeks May

5 min readMay 28, 2023

A Generalization of Transformer Networks to Graphs

[https://arxiv.org/pdf/2012.09699.pdf]

Introduction

How far is Transformer’s wildness. It’s hard to guess. dominant model Transformer in all data, this paper proposes an extended version to utilize the Transformer structure in graphs.

**This paper was published by Professor Xavier Bresson** who conducts the ‘CS6208’ class. If you want to study transformer and graph combinations, I think the data on the link will be useful.

Preliminary

NLP transformer vs. Graph transformer

The Transformer architecture has been successfully applied to both graph and natural language processing (NLP) tasks, but there are some differences in their applications. Here’s a comparison:

Input Representation:

Graph: In graph applications, the input typically consists of a graph structure, represented as nodes and edges, along with their associated features. Nodes can have attributes, and edges can have weights or labels.
NLP: In NLP applications, the input usually consists of sequential data, such as sentences or documents, represented as sequences of tokens. Each token may have an associated embedding or encoding.

Attention Mechanism:

Graph: Graph-based Transformers employ attention mechanisms to capture relationships between nodes in a graph. Attention weights are computed based on the similarity or relevance between nodes, allowing the model to attend to different parts of the graph during processing.
NLP: NLP Transformers also use attention mechanisms, but they primarily focus on capturing dependencies and relationships between words or tokens in a sentence or document. Self-attention allows the model to attend to different words at different positions, capturing contextual information.

Task-Specific Layers:

Graph: Graph-based Transformers often include additional layers or modules designed specifically for graph-related tasks, such as graph convolutional layers or graph pooling operations. These layers help to incorporate graph structure and leverage it for various tasks like node classification, graph generation, or link prediction.
NLP: NLP Transformers typically include task-specific layers like classification or regression heads on top of the Transformer backbone. These layers are responsible for mapping the learned representations to specific output tasks such as sentiment analysis, machine translation, or question answering.

Output Interpretation:

Graph: In graph applications, the output of a graph-based Transformer can vary depending on the task. It could involve node-level predictions, graph-level predictions, or even graph structure modifications.
NLP: In NLP applications, the output of an NLP Transformer is often focused on tasks like text classification, language modeling, machine translation, or named entity recognition, which typically involve generating predictions or generating text.

Summary

As one of the Extended version features, we propose a layer that can apply edge features. It appeals to the fact that the connection design, which is the biggest advantage of graph, can be applied to graph transformer. Previously, the results derived from attrition updated parameters dominantly, but this extended version shows the intention to transform the attrition’s intervention more appropriately in the graph.

It is a technique that is frequently used to alleviate isomophobia issues. Positional coding. It’s very interesting to compare and interpret how performance changes according to that perspective with existing positional coding (NLP). I recommend that you simply apply LapPE to the node feature before you put it on the first layer and check the performance depending on whether you don’t.

Insight

What LayerNorm really does for Attention in Transformers

Blog : [https://lessw.medium.com/what-layernorm-really-does-for-attention-in-transformers-4901ea6d890e]

On the Expressivity Role of LayerNorm in Transformers’ Attention

Paper : [https://arxiv.org/abs/2305.02582]

This data shows contrary results to the results of claiming that batch normalization performs better than layer normalization in the paper. There seem to be many ideas that can be applied with Future Work, so I thought it would be good to share them with readers, so I put them as omakase side dishes this week.

Smallworldness in Hypergraphs

[https://arxiv.org/abs/2304.08904]

Introduction

It’s a famous experiment. Small world experiment, ‘What are the chances of having a friend who doesn’t know each other at all?’ Starting with curiosity, we showed that if you go through six steps, information will be delivered to the first person you specify.

Since then, some theories have been developed and contributed to this day, but this paper points out that all of these theories are assumed only in small world & pair-wise situations.

So how did you come up with the idea to overcome this limitation, the hypergraph. There would also be a situation where it’s groop-wise (I’ll call it a group because there are a total of 3 nodes tied up)

but if this situation happens on a small world network, it also can everyone connect with just 6 steps like a small world experiment? If it’s not step 6, then we’re going to talk about which of the two situations we’re going to fall into.

** pair-wise , node — node connection by 1:1

** group-wise, node — node conncetion by 1:N, the N means more than 2 nodes.

Preliminiaries

Small-world networks: These are networks that exhibit both high clustering and short path lengths. They are neither completely random nor completely ordered but instead have a “small-world” structure.
Regular lattices: These are networks in which each node has the same number of neighbors, and the connections between nodes are arranged in a regular pattern.
Random graphs: These are networks in which the connections between nodes are formed randomly, without any specific pattern or structure.
Watts-Strogatz algorithm: This is a mechanism for generating small-world networks by starting with a regular lattice and randomly rewiring connections with a given probability.

Summary

Using rewiring algorithms to learn how hypergraphs will affect generated networks (regualr lattice, random graph). For comparison of post-war results, we use two metrics: an indicator of how clustering is formed in the network, and an indicator of the minimum distance between a particular node and a node.

As a result, the transition characteristics are similar to those of the Q-uniform hypergraph, but changes in the hyperedge order reduce the range of rewiring probabilities that occur; the effect depends on the hyperedge order.

Insight

I think it’s a paper that showed the insight of group-wise transition is good or not. For example, there’s a situation where everyone doesn’t just talk 1:1, but also talk 1:N, and I think it’s easy to understand if you compare the information that occurs during this story, oral information, is good or bad.

Although there are many limitations in that the data characteristics are generated network and properties are not considered within it, I think it was very fresh about the largest novelty originally designed and that novelty proof procedure.

YOU MUST EAT THIS GRAPH NEWS, GRAPH OMAKASE. 4 weeks May

A Generalization of Transformer Networks to Graphs

Written by Jeong Yitae

No responses yet