Graph FDS | before we implement the model , we must know the trend of FDS paper — 03

Jeong Yitae
5 min readJan 29, 2024

--

H2-FDetector: A GNN-based Fraud Detector with Homophilic and Heterophilic Connections

[Paper Info]

[Main Architecture]

“Even at this moment, scammers are actively developing and implementing fraud tactics to maliciously misappropriate goods from innocent citizens. Particularly with the advancement of online channels, there has been a surge in online fraud cases compared to offline, due to increased interactions across various platforms.

The reason for this increase can be attributed to scammers applying ‘hyper-personalized fraud’ tactics, taking into account the characteristics of their targets. For example, there is an increase in impersonation crimes using the keyword ‘children’ targeting the elderly, who are of an older age group. On the other hand, the younger demographic, due to their limited social experience, is increasingly being targeted by crimes impersonating public authorities, exploiting their fears of future employment restrictions.

To counteract this type of hyper-personalized fraud, new methods are necessary. While various preventative measures exist, this paper proposes an embedding method that simultaneously considers the characteristics of ‘similar’ and ‘dissimilar’ patterns. Previous methods focused solely on embedding ‘similar’ patterns, leading to the omission of dissimilar patterns. This oversight neglected the importance of accurately distinguishing between methods used in crimes and those that are not, as non-criminal patterns can also be a crucial feature in distinguishing criminal usage patterns.

The paper discusses ‘homophily’ and ‘heterophily’, which have been traditionally described as ‘similarity’. Briefly, homophily measures whether the nodes connected to me have similar characteristics, while heterophily measures dissimilarity.

As mentioned, accurately measuring ‘homophily’ and ‘heterophily’ in the network and reflecting this in embeddings is crucial.

- Label Identification:
We utilize edge label information to learn and quantify the unique characteristics of each edge, inputting this into an embedding layer and then passing it through an MLP layer to differentiate each edge’s characteristics. The labels used are edge types, and we employ an auxiliary loss to refine the edge types, adjusting the learning Loss.

- Connection Aggregation:
Based on the quantified unique characteristics of edges, we aggregate homophily and heterophily. To resolve the issue of traditional attention methods merely being a simple weighted sum, making it difficult to distinguish between homophily and heterophily characteristics, we apply attention to subgraphs of different connections based on the central node and concatenate these.

- Prototype Extraction:
To reflect the characteristics of different classes in Inter and Intra, we prototype typical normal and abnormal nodes based on distance. We consider an additional criterion for ‘normal’ and ‘abnormal’ by measuring the distance difference of each node from the representative nodes.

The proposed methodology demonstrates superior performance compared to CARE-GNN and PC-GNN architectures that have undergone undersampling. Especially in the financial sector, where labeling each transaction type is challenging, leading to more significant data distribution imbalances compared to other industries. Therefore, the model proposed in this paper shows superior performance for imbalanced data, emphasizing the importance of heterophily.

Additionally, the correlation between the distribution aspect of model performance results and the measurement results of imbalance, as shown in Figure 3, is noted. Particularly, in cases where data imbalance is high, methods considering heterophily instead of undersampling are confirmed to be more effective performance-wise.

Consequently, the method reflecting heterophily is deemed very important. In section 4.4 Ablation study, we compared the performance between models excluding the method considering heterophily with attention and the method considering inter and intra class distance with prototypes. The prototype method was shown to be more superior, interpreted as better distinguishing and reflecting heterophily characteristics based on distance from the given anchor node.

The method of measuring homophily and heterophily using edge connection type is judged as a very original and persuasive paper. Particularly in an era where the transformer structure is recognized as dominant, this paper effectively reflects the trend with various ideas on how to utilize attention.”

Although, the paper didnt have reproduce code , we implement it using pytorch_geometric serveral module. i think the useful concepts and matchin them each module by concepts.

  1. edge-type subgraph

the graph must ‘heterograph’ type and so we need to use ‘hetero data’ the source code is below things.

code from [https://pytorch-geometric.readthedocs.io/en/2.4.0/_modules/torch_geometric/data/hetero_data.html#HeteroData.edge_type_subgraph]

def edge_type_subgraph(self, edge_types: List[EdgeType]) -> 'HeteroData':
r"""Returns the subgraph induced by the given :obj:`edge_types`, *i.e.*
the returned :class:`HeteroData` object only contains the edge types
which are included in :obj:`edge_types`, and only contains the node
types of the end points which are included in :obj:`node_types`."""
edge_types = [self._to_canonical(e) for e in edge_types]

data = copy.copy(self)
for edge_type in self.edge_types:
if edge_type not in edge_types:
del data[edge_type]
node_types = set(e[0] for e in edge_types)
node_types |= set(e[-1] for e in edge_types)
for node_type in self.node_types:
if node_type not in node_types:
del data[node_type]
return data

2. multi-head attention by rel type

import torch
from torch_geometric.nn import GATConv
from torch_geometric.data import Data, NeighborSampler
from torch.nn import ModuleDict

# Assuming a simple graph structure
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]], dtype=torch.long)
edge_type = torch.tensor([0, 1, 1, 0], dtype=torch.long) # Example edge types
x = torch.tensor([[-1], [0], [1]], dtype=torch.float) # Node features

data = Data(x=x, edge_index=edge_index)

class RelationSpecificGAT(torch.nn.Module):
def __init__(self, in_channels, out_channels, num_relations):
super(RelationSpecificGAT, self).__init__()
self.convs = ModuleDict({
f'rel_{i}': GATConv(in_channels, out_channels, heads=1)
for i in range(num_relations)
})

def forward(self, x, edge_index, edge_type):
edge_type_unique = edge_type.unique()

# Initialize output features as zeros
out = torch.zeros(x.size(0), self.convs[f'rel_0'].out_channels)

for et in edge_type_unique:
mask = edge_type == et
ei = edge_index[:, mask]
out += self.convs[f'rel_{et}'](x, ei)
return out

# Define and run the model
model = RelationSpecificGAT(in_channels=1, out_channels=10, num_relations=2)
output = model(data.x, data.edge_index, edge_type)

print(output)

3. auxiliary loss and prototyping for capturing these things characters heterophily and homophily.

def auxiliary_loss(node_embeddings, edge_index, alpha=0.5):
# Calculate pairwise distances between all node embeddings
distance_matrix = torch.cdist(node_embeddings, node_embeddings, p=2)

# Extract distances for actual edges
actual_distances = distance_matrix[edge_index[0], edge_index[1]]

# Homophily loss: Encourage smaller distances between connected nodes
homophily_loss = actual_distances.mean()

# Heterophily loss: Encourage larger distances between randomly selected non-connected nodes
num_edges = edge_index.size(1)
random_indices = torch.randint(0, node_embeddings.size(0), (2, num_edges))
random_distances = distance_matrix[random_indices[0], random_indices[1]]
heterophily_loss = 1 / (random_distances.mean() + 1e-6)

# Combine losses
return alpha * homophily_loss + (1 - alpha) * heterophily_loss

--

--

Jeong Yitae
Jeong Yitae

Written by Jeong Yitae

Linkedin : jeongyitae I'm the graph and network data enthusiast from hardware to software(application)

No responses yet