Interpretable ICO Fraud Prediction Model by Fusing Multi-Source Heterogeneous Data
Introduction
Initial Coin Offerings (ICOs) have emerged as one of the most prominent applications of blockchain technology, representing a novel financing model where project issuers raise capital by selling project tokens to investors. Compared to traditional financing methods such as equity financing or initial public offerings (IPOs), ICOs offer lower barriers to entry, faster fundraising, and higher efficiency. Since the launch of the first ICO, Mastercoin, in July 2013, the ICO market has experienced rapid growth, particularly during the boom period of 2017-2018. By the first three quarters of 2018, blockchain companies raised $2 billion through ICOs, significantly outpacing the $350 million raised through traditional venture capital. As of January 2021, over 5,728 ICO projects have collectively raised more than $27 billion, making ICOs the most popular financing method for blockchain startups.
However, the decentralized and anonymous nature of blockchain technology has also made ICOs a breeding ground for fraudulent activities, including scams, black-market transactions, and money laundering. A notorious example is the 2018 Vietnamese Modern Tech ICO scam, where the company raised $658 million from approximately 32,000 investors before disappearing. This case is just one among many fraudulent incidents that have plagued the cryptocurrency space. Studies estimate that between 2017 and 2020, around 80% of ICOs were either scams or failed projects, with investors losing over $1.9 billion in 2020 alone. In China, the situation is equally concerning, with reports indicating that 90% of ICOs are suspected of intentional fraud, while less than 1% are genuinely used for project development.
Given these challenges, there is an urgent need for reliable ICO fraud prediction and early warning systems to protect investors and maintain a stable financial environment. Existing research on ICO fraud detection primarily relies on machine learning and deep learning techniques. However, these approaches often suffer from limitations such as single-source data dependency, high computational costs, and lack of model interpretability. To address these shortcomings, this paper proposes an interpretable ICO fraud prediction model, IICOFP, which integrates multi-source heterogeneous data, employs advanced feature engineering techniques, and leverages the SHAP framework for enhanced interpretability.
Related Work
Research on ICO fraud prediction has evolved along two main directions: machine learning-based approaches and deep learning-based approaches. Machine learning models typically rely on feature engineering to extract meaningful information from ICO project descriptions, whitepapers, and websites. For instance, Bian et al. developed the first machine learning-based ICO fraud prediction model, IcoRating, which achieved an accuracy of 83%, a recall of 77%, and an F1-score of 80%. While this model set a foundation for subsequent research, its performance was limited by the quality of feature engineering.
Deep learning models, on the other hand, leverage natural language processing (NLP) and neural networks to analyze raw data from whitepapers and websites. For example, Di et al. proposed a graph neural network (GNN) model for ICO classification, but its F1-score was only 59%, falling short of practical application requirements. Similarly, Xu et al. introduced an A-BiRNN model that achieved an F1-score of 73.2%, with limited interpretability.
Despite these efforts, existing models face two critical challenges: (1) reliance on single-source data, which may lead to incomplete or inaccurate feature representations, and (2) a lack of interpretability, which hinders trust and decision-making in real-world applications. The proposed IICOFP model addresses these gaps by integrating multi-source heterogeneous data, improving feature engineering, and incorporating the SHAP framework for model explainability.
Methodology
Problem Formulation
The IICOFP model is designed as a binary classification system that predicts whether a newly launched ICO is likely to be fraudulent. Given a dataset D with a feature space X^M (where M is the number of features) and a label space Y, the model learns a function f that maps input features to predicted labels. The goal is to optimize this function to achieve high accuracy, precision, recall, and F1-score while maintaining interpretability.
Feature Engineering
A key innovation of IICOFP is its integration of multi-source heterogeneous data, including:
- Basic Project Information: Token sale duration, team size, and project categories.
- Rating Scores: Comprehensive ratings from platforms like ICObench and ICOmarks.
- Social Media Presence: Number of active social media platforms and engagement metrics.
- Technical Indicators: GitHub repository activity and whitepaper validity.
To preprocess this data, the following steps are applied:
- Missing Value Imputation: Features with missing values are filled using statistical methods (e.g., mode for categorical data).
- One-Hot Encoding: Categorical variables like country names are converted into binary vectors.
- Tomek-Link Undersampling: This technique removes noisy and borderline samples to improve class balance.
- Lasso Feature Selection: Reduces dimensionality by retaining only the most predictive features.
Model Training with GBDT
The Gradient Boosting Decision Tree (GBDT) algorithm is chosen for its ability to handle complex, noisy, and high-dimensional data. GBDT builds an ensemble of decision trees sequentially, with each tree correcting the errors of its predecessor. This iterative process enhances the model’s robustness and predictive power. Key advantages of GBDT include:
• Non-Linear Relationship Capture: Effectively models intricate patterns in ICO data.
• Residual Optimization: Minimizes prediction errors by focusing on misclassified samples.
• Regularization: Controls overfitting through learning rate adjustments and tree depth limits.
Interpretability with SHAP
To enhance transparency, the SHAP (Shapley Additive Explanations) framework is employed to analyze feature contributions. SHAP values quantify the impact of each feature on model predictions, enabling stakeholders to understand which factors drive fraud classification. For example, a high SHAP value for “low team rating” indicates that this feature strongly influences the model’s decision to flag an ICO as fraudulent.
Experimental Results
Performance Evaluation
The IICOFP model is evaluated using standard metrics: accuracy, precision, recall, F1-score, and AUC. On a test set of 237 ICOs (121 fraudulent and 116 successful), the model achieves:
• Accuracy: 87.76%
• Precision: 85.37%
• Recall: 90.52%
• F1-Score: 87.87%
• AUC: 87.82%
These results represent a 2%-10% improvement over existing models like IcoRating, XGBoost, and A-BiRNN.
Comparative Analysis
Table 4 compares IICOFP with state-of-the-art models, demonstrating its superior performance. For instance, while XGBoost achieves an F1-score of 80.42%, IICOFP reaches 87.87%. Similarly, the AUC score of 87.82% outperforms logistic regression (59.3%) and GNN (unspecified).
Key Findings
-
Critical Fraud Indicators:
• Low Overall Rating: Projects with poor expert ratings are more likely to be scams.• Small Team Size: Fraudulent ICOs often have fewer team members, indicating limited execution capability.
• Limited Social Media Presence: Genuine projects actively engage with investors through multiple platforms.
• Excessive Token Supply: Issuing an unusually high number of tokens is a red flag.
• Prolonged Fundraising Periods: Fraudsters may extend token sales to attract more victims.
-
Feature Interactions:
• High-rated projects tend to have larger teams (Figure 7a).• Shorter token sale durations correlate with higher success rates (Figure 7c).
• Active social media engagement boosts investor confidence (Figure 7d).
-
Case Study: A fraudulent ICO with 5.5 billion tokens, a low rating (2.8), and no CEO photo was correctly classified due to negative SHAP contributions from these features (Figure 8).
Conclusion
The IICOFP model advances ICO fraud detection by integrating multi-source data, optimizing feature engineering, and leveraging interpretable machine learning. Its high performance and transparency make it a valuable tool for investors and regulators. Future work will address adversarial attacks (e.g., data poisoning) and concept drift to ensure long-term reliability.
doi.org/10.19734/j.issn.1001-3695.2024.05.0220
Was this helpful?
0 / 0