Graph FDS | before we implement the model , we must know the trend of FDS paper — 04

6 min readFeb 4, 2024

Pick and Choose: A GNN-based Imbalanced Learning Approach for Fraud Detection

[paperInfo]

[Architecture]

Alibaba Group data shows that only 0.5% of users fail to repay their credit debts, being labeled as defaulters.

Despite various attempts to distinguish these individuals, two main challenges have hindered inference performance. Firstly, irrelevant relationship (edge) information often acts as noise. For example, when spammers temporarily use legitimate IDs to post spam reviews, the authenticity of these IDs dilutes the distinction between spam and non-spam, making it difficult to differentiate. From a machine learning perspective, both feature-based and label-based similarities become muddled with mixed 0s and 1s, leading to unclear differentiation criteria and, consequently, decreased performance.

Secondly, malicious users deliberately avoid interactions with each other by setting rules to disguise their connections, making it appear as if they are unrelated. When the same individual conducts multiple transactions using different IDs, although the transaction patterns are similar, the fact that they are performed under various IDs complicates the identification of these patterns based on the transaction ‘nodes,’ thus degrading performance.

To address these issues, this paper proposes focusing on effectively extracting and distinguishing meaningful ‘pattern’ information by carefully selecting a relatively small amount of abnormal transaction patterns compared to normal ones.

The core idea is straightforward, as indicated by the terms Pick and Choose. To minimize imbalance during mini-batch training, a label-balanced sampling is employed, where the sampling probability is crucial. By setting the probability with the adjacency matrix of the node as the numerator and the label distribution as the denominator, the probability of being included in the subgraph decreases as the denominator (label distribution) increases, and vice versa. This approach ensures a balanced subgraph by regulating based on the abundance or scarcity of labels.

Furthermore, to enhance inference performance from the extracted subgraph, a distance-based function is used for filtering, aiming to increase the amount of meaningful information. This involves over-sampling nodes with fewer labels and under-sampling those with more labels, based on their relative distance, but not indiscriminately. Instead, the decision to sample is guided by the significance of the information each node holds, utilizing a distance-based function.

Unlike traditional distance functions like the Euclidean, which do not consider label information, this idea introduces a fully connected layer that considers label information. The layer’s weights connected to Fraud and those connected to Benign are used to determine similarity; smaller differences indicate similarity, leading to over-sampling, while larger differences lead to under-sampling.

The final model, combining under-sampled and over-sampled results, is trained by minimizing two loss values: the existing GNN loss and an additional loss from the distance-based function derived through the Pick and Choose method.

Experimental results, compared against similar models as baselines, demonstrate the effectiveness of this approach. Specifically, the discussion in RQ1 highlights why this idea performs well in terms of fraud detection, specialized sampling, and focus on minority classes.

However, an interesting observation from the ablation study in Table 5 shows that contrary to expectations, the model (\p) without the label-balanced sampler outperforms both the model (\c) without the neighborhood sampler and the final model applying both. This suggests that, in some cases, simple random sampling followed by under and over sampling may be more effective than extracting subgraphs based on label distribution, which could lead to significant information loss due to the underrepresentation of certain label classes. This interpretation is personal, and the paper attributes the results to potential information loss when extracting subgraphs for less represented label classes.

this section i fetch the code from ‘codelink’ what i wrote introduction. i think core things in this idea were three. 1. aggregation 2. sampling 3. decision sampling from distance function result.

1. Aggregation both inter-intra.

###########################################################
#####################  InterAgg  ##########################
###########################################################

# extract 1-hop neighbor ids from adj lists of each single-relation graph
to_neighs = []
for adj_list in self.adj_lists:
 to_neighs.append([set(adj_list[int(node)]) for node in nodes])

# find unique nodes and their neighbors used in current batch
unique_nodes = set.union(set.union(*to_neighs[0]), set.union(*to_neighs[1]),
       set.union(*to_neighs[2], set(nodes)))

# calculate label-aware scores
if self.cuda:
 batch_features = self.features(torch.cuda.LongTensor(list(unique_nodes)))
 pos_features = self.features(torch.cuda.LongTensor(list(self.train_pos)))
else:
 batch_features = self.features(torch.LongTensor(list(unique_nodes)))
 pos_features = self.features(torch.LongTensor(list(self.train_pos)))
batch_scores = self.label_clf(batch_features)
pos_scores = self.label_clf(pos_features)
id_mapping = {node_id: index for node_id, index in zip(unique_nodes, range(len(unique_nodes)))}

# the label-aware scores for current batch of nodes
center_scores = batch_scores[itemgetter(*nodes)(id_mapping), :]

# get neighbor node id list for each batch node and relation
r1_list = [list(to_neigh) for to_neigh in to_neighs[0]]
r2_list = [list(to_neigh) for to_neigh in to_neighs[1]]
r3_list = [list(to_neigh) for to_neigh in to_neighs[2]]

# assign label-aware scores to neighbor nodes for each batch node and relation
r1_scores = [batch_scores[itemgetter(*to_neigh)(id_mapping), :].view(-1, 2) for to_neigh in r1_list]
r2_scores = [batch_scores[itemgetter(*to_neigh)(id_mapping), :].view(-1, 2) for to_neigh in r2_list]
r3_scores = [batch_scores[itemgetter(*to_neigh)(id_mapping), :].view(-1, 2) for to_neigh in r3_list]

# count the number of neighbors kept for aggregation for each batch node and relation
r1_sample_num_list = [math.ceil(len(neighs) * self.thresholds[0]) for neighs in r1_list]
r2_sample_num_list = [math.ceil(len(neighs) * self.thresholds[1]) for neighs in r2_list]
r3_sample_num_list = [math.ceil(len(neighs) * self.thresholds[2]) for neighs in r3_list]

# intra-aggregation steps for each relation
# Eq. (8) in the paper
r1_feats, r1_scores = self.intra_agg1.forward(nodes, labels, r1_list, center_scores, r1_scores, pos_scores, r1_sample_num_list, train_flag)
r2_feats, r2_scores = self.intra_agg2.forward(nodes, labels, r2_list, center_scores, r2_scores, pos_scores, r2_sample_num_list, train_flag)
r3_feats, r3_scores = self.intra_agg3.forward(nodes, labels, r3_list, center_scores, r3_scores, pos_scores, r3_sample_num_list, train_flag)

# get features or embeddings for batch nodes
if self.cuda and isinstance(nodes, list):
 index = torch.LongTensor(nodes).cuda()
else:
 index = torch.LongTensor(nodes)
self_feats = self.features(index)

# number of nodes in a batch
n = len(nodes)

# concat the intra-aggregated embeddings from each relation
# Eq. (9) in the paper
cat_feats = torch.cat((self_feats, r1_feats, r2_feats, r3_feats), dim=1)

combined = F.relu(cat_feats.mm(self.weight).t())

return combined, center_scores

###########################################################
#####################  IntraAgg  ##########################
###########################################################


# find the unique nodes among batch nodes and the filtered neighbors
unique_nodes_list = list(set.union(*samp_neighs))
unique_nodes = {n: i for i, n in enumerate(unique_nodes_list)}

# intra-relation aggregation only with sampled neighbors
mask = Variable(torch.zeros(len(samp_neighs), len(unique_nodes)))
column_indices = [unique_nodes[n] for samp_neigh in samp_neighs for n in samp_neigh]
row_indices = [i for i in range(len(samp_neighs)) for _ in range(len(samp_neighs[i]))]
mask[row_indices, column_indices] = 1
if self.cuda:
  mask = mask.cuda()
  num_neigh = mask.sum(1, keepdim=True)
  mask = mask.div(num_neigh)  
# mean aggregator
if self.cuda:
 self_feats = self.features(torch.LongTensor(nodes).cuda())
 embed_matrix = self.features(torch.LongTensor(unique_nodes_list).cuda())
else:
 self_feats = self.features(torch.LongTensor(nodes))
 embed_matrix = self.features(torch.LongTensor(unique_nodes_list))
agg_feats = mask.mm(embed_matrix)  # single relation aggregator
cat_feats = torch.cat((self_feats, agg_feats), dim=1)  # concat with last layer
to_feats = F.relu(cat_feats.mm(self.weight))
return to_feats, samp_scores

2. choose step neighbors

###########################################################
############  choose step neighbors  ######################
###########################################################

samp_neighs = []
samp_score_diff = []
for idx, center_score in enumerate(center_scores):
 center_score = center_scores[idx][0]
 neigh_score = neigh_scores[idx][:, 0].view(-1, 1)
 center_score_neigh = center_score.repeat(neigh_score.size()[0], 1)
 neighs_indices = neighs_list[idx]
 num_sample = sample_list[idx]

 # compute the L1-distance of batch nodes and their neighbors
 score_diff_neigh = torch.abs(center_score_neigh - neigh_score).squeeze()
 sorted_score_diff_neigh, sorted_neigh_indices = torch.sort(score_diff_neigh, dim=0, descending=False)
 selected_neigh_indices = sorted_neigh_indices.tolist()

 # top-p sampling according to distance ranking
 if len(neigh_scores[idx]) > num_sample + 1:
  selected_neighs = [neighs_indices[n] for n in selected_neigh_indices[:num_sample]]
  selected_score_diff = sorted_score_diff_neigh.tolist()[:num_sample]
 else:
  selected_neighs = neighs_indices
  selected_score_diff = score_diff_neigh.tolist()
  if isinstance(selected_score_diff, float):
   selected_score_diff = [selected_score_diff]

 if center_labels[idx] == 1:
  num_oversample = int(num_sample * sample_rate)
  center_score_minor = center_score.repeat(minor_scores.size()[0], 1)
  score_diff_minor = torch.abs(center_score_minor - minor_scores[:, 0].view(-1, 1)).squeeze()
  sorted_score_diff_minor, sorted_minor_indices = torch.sort(score_diff_minor, dim=0, descending=False)
  selected_minor_indices = sorted_minor_indices.tolist()
  selected_neighs.extend([minor_list[n] for n in selected_minor_indices[:num_oversample]])
  selected_score_diff.extend(sorted_score_diff_minor.tolist()[:num_oversample])

 samp_neighs.append(set(selected_neighs))
 samp_score_diff.append(selected_score_diff)

return samp_neighs, samp_score_diff

3. Choose step test

###########################################################
###############  choose step test  ########################
###########################################################

samp_neighs = []
samp_scores = []
for idx, center_score in enumerate(center_scores):
 center_score = center_scores[idx][0]
 neigh_score = neigh_scores[idx][:, 0].view(-1, 1)
 center_score = center_score.repeat(neigh_score.size()[0], 1)
 neighs_indices = neighs_list[idx]
 num_sample = sample_list[idx]

 # compute the L1-distance of batch nodes and their neighbors
 score_diff = torch.abs(center_score - neigh_score).squeeze()
 sorted_scores, sorted_indices = torch.sort(score_diff, dim=0, descending=False)
 selected_indices = sorted_indices.tolist()

 # top-p sampling according to distance ranking and thresholds
 if len(neigh_scores[idx]) > num_sample + 1:
  selected_neighs = [neighs_indices[n] for n in selected_indices[:num_sample]]
  selected_scores = sorted_scores.tolist()[:num_sample]
 else:
  selected_neighs = neighs_indices
  selected_scores = score_diff.tolist()
  if isinstance(selected_scores, float):
   selected_scores = [selected_scores]

 samp_neighs.append(set(selected_neighs))
 samp_scores.append(selected_scores)

return samp_neighs, samp_scores

[Contact info]

email — jeongiitae6@gmail.com

linkedin — https://www.linkedin.com/in/ii-tae-jeong/