You MUST EAT THIS GRAPH NEWS, Graph Omakase. 1 weeks FEB

Jeong Yitae

5 min readFeb 5, 2023

## Network knowledge ##

Static and dynamic robustness

Dynamic !

Do you remember Robustness? If last week’s Omakase covered the ‘static’ context, this parking is the ‘dynamic’ context! The biggest difference between static and dynamic is mechanics. It’s also a concept used in epidemiological investigations of COVID-19. The SIR model. The key is to keep an eye on the network conditions that change in real time, such as someone being treated and someone being infected. So how are we going to make a prediction? Well…! It would be more appropriate to simulate how the load between nodes is transmitted than the expression prediction.
The key to dynamic robustness is ‘time’ and ‘load’. It’s to infer the cascade phenomenon by grasping the flow of the load from where to where it flows at a specific time. Through this, we measure phenomena with parameters called quantification and tolerance, identify which parts are vulnerable, and respond quickly. It’s a concept that works not only for diseases, but also for all data made up of connections such as Internet network systems.
From a business perspective, it’s also used in viral marketing! There are so many areas where complex network knowledge is linked not only to information spread on sns, but also to instantaneous scalability and ripple effects delivered to other communities :) Isn’t it interesting?!
What I’m telling you now is only a fraction of ‘Complex networks: Structure and dynamics’. It contains analysis of various phenomena such as rumor spreading and fluidizing, so I recommend you to take a look!

NodeAug: Semi-supervised node classification through data growth

https://dl.acm.org/doi/abs/10.1145/3394486.3403063

It’s an idea that comes to mind right away when you’re in charge of the Imbalanced data task. Data augmentation. Data augmentation is also possible in graphs. In this paper, graph augmentation is carried out by modifying node attributes and graph structure. Also, we talk about how to train in large graph through subgraph mini-batch training.
Data augmentation → data augmentation requires not only a data distribution or manipulation stage, but also a step to verify that the augmented data is appropriate. First, augmented data means random remove/replace the node edge connected to the graph and reconnect. After that, the verification step measures the distribution of the graph before and after remove/replace in KL and consists of minimizing. It’s very simple, but there are very interesting ideas in the process of removing. If you’re going to read the paper, I recommend you to keep this in mind.
All of the experimental results have a simple logic that their ideas are good. The same goes for this paper. But here’s an insight that’s useful to you in experimental interpretation. It’s an interpretation of what graph embeddings data augmentation is effective in and why the experimental results came out. It has been applied to all of the representative models such as GCN GAT LGCN, but the performance of GCN has improved enough to stand out among them. I think it would be good to look at the reason why in the paper! If you stick to the view that not all data management is universal, I think it will lead to interesting ideas. 🙂

ByteGraph: A Graph Database for TikTok

https://www.mydistributed.systems/2023/01/bytegraph-graph-database-for-tiktok.html?m=1

How does TikTok, famous for its short video platform, manage data? I think real-time is a very important factor even from an intuitive perspective. I created my own database to manage a lot of real-time data. ByteGraph!
Low latency and high scalability To satisfy both conditions, we designed the database focusing on 1. Edge-tree 2. Adaptive optimization 3. Geographic replication elements.
To give you a brief description of the three elements, edge-tree creates and manages the b-tree structure of an adjancety list. Tree ramen nodes also exist, right? Root, meta, and edge manage the sizes required to read and write data in upper-bound and lower-bound through three nodes. Upper bound or lower bound can be considered as a kind of data-available boundary that reduces or increases disk I/O when data approaches the bound.
Adaptive optimization consists of a dynamic thread pool that determines which thread pool is efficient to handle when a request for data occurs, and a secondary edge tree that is inefficient to scan the entire edge-tree when there is no key matching the data.
Geographic replication is consists of a data center , cross data centers in the same region and cross data centers in different region in fault tolerance and high availability context . You can think of it as a strategy for how to deploy db according to each geographic characteristic.
- I think it’s a useful reference for those who were curious about how to design high scalability, low latency elements that are important at the production level. I remember the experiments compared to many famous GDBs, especially neptune. I think it’s amazing that the bygraph shows high performance that encompasses OLTP, OLSP, and OLAP…!

NFT Wash Trading in the Ethereum Blockchain

https://arxiv.org/pdf/2212.01225.pdf

Do you remember the virtual currency that was once a craze, NFT? The value calculation for NFT was mainly based on the index called rarity. However, NFTs with high prices are often generated even though the rarity index is low. Of course, there are many reasons for the situation, but most of them are malicious actions that raise prices through collusion between users.
It’s a paper that deals with how the behaviors usually appear. Also, do you remember the strongly connected component we studied the other day? Complex network knowledge is also used here. It’s used to distinguish suspicious behavior among a lot of data. The part that describes why connected component is also interesting.
Although interest in NFT has waned a lot now, we don’t know the future! I recommend it because it is a paper that describes well the overview of how ordinary tx detection is performed by combining graphs and blockchains 🙂 It’s easy and fun!

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

[https://arxiv.org/pdf/1910.02054.pdf]

This paper discusses parallelism and whether data parallelism will have a positive effect on model training. It’s easy to say, but I think it’s a thesis on engineering skills.
It is thought that various experiments and interpretations by experts such as Pipeline parallelism, Model parallelism, and Cpu offloading are collected. I think it will be of great help to the engineers who handle the Traillion parameter level model.

big graph with Graph Embedding

This is the second series of big graph. It’s a way of dealing with big graphs through models. This is where [OGB LSC] (https://ogb.stanford.edu/docs/lsc/)) is competing with many researchers at this time.

**Recent Advances in Efficient and Scalable Graph Neural Networks**

(https://www.chaitjo.com/post/efficient-gnns/#scalable-and-resource-efficient-gnn-architectures)

If you look at the Scalable and resource efficiency GNN architectures, it contains all the topics I’m going to discuss today. The main concept is that excessive memory overhead occurs in feature aggregation, so pre-computing them and loading them into MLP (linear-layer) makes them light. Starting with SGC, the motif of lightGCN, it is explained well.
If the goal is to become a GNNML engineer, I think it’s a must-read blog. It contains various engineering skills and practical graph embedding papers. I think it’s a quality post that’s hard to see.

All resource was posted in this channel.

https://www.graphusergroup.com/