Fraud Detection in P2P Lending
Hao Wang - Funplus AI Lab, Beijing, China
-- In the early 2010’s, P2P financial companies started to take off in China. The booming of personal lending business lasted for nearly a decade until recently when the Chinese government began to tighten regulations to remove illegal practices and kill bubbles. Fraud detection is crucial for P2P money lending companies due to the strikingly high default and fraud rates of borrowers. On average, a Chinese p2p lending company has a fraud rate of higher than 10% among borrowers. This is a very high number compared with conventional banks. High fraud rate not only implies loss of money but also raises the interest rate, so credit risk modeling and fraud detection are critical for any P2P corporation. In our paper “Detection of Fraudulent Users in P2P Financial Market”, we demonstrate how we tackle the crucial technical and business problems at HC Financial Service Group using simple machine learning technologies.
Unlike many other IT areas where deep learning has become the prevailing technology, in the area of fraudulent user detection shallow learning is still the king. Most practices in the industry use gradient boosting decision tree and its variants such as xgboost and lightgbm or their ensembles dominate the common practice. Feature engineering serves as a vital preprocessing step for this kind of algorithm to work. The availability of different kinds of data determines, in turn, what feature engineering can do and how effective feature engineering can be.
The data input to our algorithm pipeline is composed of 4 categories :
Financial Information : Features in this group include user’s personal income , car payments , house rent, etc.
Work Information : Features in this group include user’s company’s income, how long the company has lasted, etc.
Transaction Information : Features in this group include the amount of money user borrows in the transaction, whether the user has submitted applications before, etc.
Demographic Information : Features in this group include the number of family members of the user, etc.
We iterate through different machine learning models from linear models such as logistic regression to ensemble methods like random forest and xgboost and compare the evaluation metrics of the experiments. The financial fraudulent detection problem is a class imbalance problem, meaning that there is only a small fraction of users that are misbehaved. This poses difficulties for machine learning techniques and it requires different evaluation metrics other than precision and recall. Thus we choose AUC as the evolution metric. AUC is a metric that is irrelevant to class ratios in data distribution and applicable in most class imbalance problems.
We discovered that ensemble methods produce better results than linear models, which is coherent with public knowledge in the industry. We compare two most prominent models in our paper, namely random forest and xgboost. We resort to PCA and tanh as our two major feature engineering options. PCA is selected because we would like to reduce the dimensionality of our data to a more theoretically friendly number; tanh is selected so the large and ill-distributed numbers in the data input is normalized.
In our experiments, we collected 50K users and trained the model in a class balanced context while testing the data in the real class imbalance setting. A simple feature trick like PCA or tanh yields an AUC of greater than 0.8 for both random forest and xgboost models. The best model and feature combination are xgboost and tanh, which yields an AUC as high as 0.88 for most cases in the parameters generated by grid search.
Our research is one of first published technologies used in P2P financial companies. Industry-wide, the AUC for fraud detection is ranging between 0.8 and 0.9. Fraud detection is such an important technical process in financial institutions that a swarm of startups specialized in data collection and fraud detection have emerged on the Chinese market. However, since 2018, the Chinese government has tightened its data security policy and strengthened law enforcement. A large host of data companies have been removed from the market due to poor practices.The lesson is: Technology makes money, but it also needs to abide by the law.