Categorical Feature Encoding Techniques for Improved Classifier Performance when Dealing with Imbalanced Data of Fraudulent Transactions

Dalia Breskuvienė; Gintautas Dzemyda

doi:10.15837/ijccc.2023.3.5433

Authors

Dalia Breskuvienė Data Science and Digital Technologies Institute, Vilnius University, Lithuania
Gintautas Dzemyda Data Science and Digital Technologies Institute, Vilnius University, Lithuania

DOI:

https://doi.org/10.15837/ijccc.2023.3.5433

Keywords:

imbalanced data, classifier, feature encoding, high-cardinality, fraud detection

Abstract

Fraudulent transaction data tend to have several categorical features with high cardinality. It makes data preprocessing complicated if categories in such features do not have an order or meaningful mapping to numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive data sets is not thoroughly evaluated.

Two transaction datasets with an imbalance lower than 1\% of frauds have been used in our study. Six encoding methods were employed, which belong to either target-agnostic or target-based groups. The experimental procedure has involved the use of several machine-learning techniques, such as ensemble learning, along with both linear and non-linear learning approaches.

Our study emphasizes the significance of carefully selecting an appropriate encoding approach for imbalanced datasets and machine learning algorithms. Using target-based encoding techniques can enhance model performance significantly. Among the various encoding methods assessed, the James-Stein and Weight of Evidence (WOE) encoders were the most effective, whereas the CatBoost encoder may not be optimal for imbalanced datasets. Moreover, it is crucial to bear in mind the curse of dimensionality when employing encoding techniques like hashing and One-Hot encoding.

References

Alarfaj, F. K.; Malik, I.; Khan, H. U.; Almusallam, N.; Ramzan, M.; Ahmed, M. (2022). Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms, IEEE Access, 10, 39700-39715, 2022.

https://doi.org/10.1109/ACCESS.2022.3166891

Altman, E. (2021). Synthesizing credit card transactions, 2nd ACM International Conference on AI in Finance (ICAIF'21), [Online].

https://doi.org/10.1145/3490354.3494378

Alonso Lopez-Rojas, E.; Axelsson, S. (2014). BankSim: A Bank Payment Simulation for Fraud Detection Research, The 26th European Modeling and Simulation Symposium, [Online]. Available: https://www.researchgate.net/publication/265736405

Breiman, L. (1984). Classification and Regression Trees (1st ed.). Routledge.

https://doi.org/10.1201/9781315139470

Breiman, L. (2001). Random Forests, Machine Learning 45, 5-32, 2001.

https://doi.org/10.1023/A:1010933404324

Breskuvien˙e, D.; Dzemyda, G. (2023). Imbalanced Data Classification Approach Based on Clustered Training Set, In: Dzemyda, G., Bernatavičien˙e, J., Kacprzyk, J. (eds) Data Science in Applications. Studies in Computational Intelligence, Springer, Cham. 1084, 43-62, 2023.

https://doi.org/10.1007/978-3-031-24453-7_3

Bourdonnaye, F.; Daniel, F. (2021). Evaluating categorical encoding methods on a real credit card fraud detection database, [Online]. Available: http://www.lusisai.com 2021.

Bugajev, A.; Kriauzien˙e, R.; Vasilecas, O.; Chadyšas, V. (2022). The Impact of Churn Labelling Rules on Churn Prediction in Telecommunications. Informatica, 33(2), 247-277, 2022.

https://doi.org/10.15388/22-INFOR484

Bulavas, V.; Marcinkevičius, V.; Rumiński, J. (2021). Study of Multi-Class Classification Algorithms' Performance on Highly Imbalanced Network Intrusion Datasets, Informatica, 32(3), 441-475, 2021.

https://doi.org/10.15388/21-INFOR457

Chalé, M.; Bastian, N. D. (2022). Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems, Expert Systems with Applications, 207, 117936, 2022.

https://doi.org/10.1016/j.eswa.2022.117936

Carneiro, E. M.; Forster, C. H. Q.; Mialaret, L. F. S.; Dias, L. A. V.; Cunha, A. M. (2022). High-Cardinality Categorical Attributes and Credit Card Fraud Detection, Mathematics, 10(20), 2022.

https://doi.org/10.3390/math10203808

Chen, C.; Liaw, A.; Breiman, L. (2004). Using random forest to learn imbalanced data, University of California, Berkeley (110), 1-12, 2004.

Chen, T.; Guestrin, C. (2016). XGBoost: A scalable tree boosting system, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794, 2016.

https://doi.org/10.1145/2939672.2939785

Dorogush, A. V.; Ershov, V.; Gulin, A. (2018). CatBoost: gradient boosting with categorical features support, [Online]. Available: http://arxiv.org/abs/1810.11363

Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D.; Fernández-Delgado, A. (2014). Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?, [Online]. Available: http://www.mathworks.es/products/neural-network.

Johnson, J. M.; Khoshgoftaar, T. M. (2020). Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction, 2020 IEEE 6th International Conference on Collaboration and Internet Computing, 145-152, 2020.

https://doi.org/10.1109/CIC50333.2020.00026

Johnson, J. M.; Khoshgoftaar, T. M. (2021). Encoding Techniques for High-Cardinality Features and Ensemble Learners, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science, 355-361, 2021.

https://doi.org/10.1109/IRI51335.2021.00055

Jordon, J. et al. (2022) Synthetic Data - what, why and how?, [Online]. Available: http://arxiv.org/abs/2205.03257

Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree, [Online]. Available: https://github.com/Microsoft/LightGBM.

Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, 3(1), 2001.

https://doi.org/10.1145/507533.507538

Moeyersoms, J.; Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision Support Systems, 72, 72-81, 2015.

https://doi.org/10.1016/j.dss.2015.02.007

Najadat, H.; Altiti, O.; Aqouleh, A. A.; Younes, M. (2020). Credit Card Fraud Detection Based on Machine and Deep Learning, 11th International Conference on Information and Communication Systems, 204-208, 2020.

https://doi.org/10.1109/ICICS49469.2020.239524

Pargent, F.; Pfisterer, F.; Thomas, J.; Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features, Computational Statistics, 37(5), 2671-2692, 2022.

https://doi.org/10.1007/s00180-022-01207-6

Peng, Y.; Qiu, Q; Zhang, D.; Yang, T.; Zhang H.(2023). Ensemble Learning for Interpretable Concept Drift and Its Application to Drug Recommendation, International Journal of Computers Communications & Control, 18(1), 5011, 2023.

https://doi.org/10.15837/ijccc.2023.1.5011

Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A. V.; Gulin, A. (2017). CatBoost: unbiased boosting with categorical features, [Online]. Available: http://arxiv.org/abs/1706.09516

Reilly, D.; Taylor, M.; Fergus, P.; Chalmers, C.; Thompson, S. (2022). The Categorical Data Conundrum: Heuristics for Classification Problems - A Case Study on Domestic Fire Injuries, IEEE Access, 10, 70113-70125, 2022.

https://doi.org/10.1109/ACCESS.2022.3187287

Russac, Y.; Caelen, O.; He-Guelton, L. (2018). Embeddings of Categorical Variables for Sequential Data in Fraud Context, Advances in Intelligent Systems and Computing

https://doi.org/10.1007/978-3-319-74690-6_53

Sagi,O.; Rokach, L. (2018). Ensemble learning: A survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), 2018.

https://doi.org/10.1002/widm.1249

Slakey, A.; Salas, D.; Schamroth, Y. (2019). Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine, [Online]. Available: http://arxiv.org/abs/1904.13001

Surowiecki, J. (2004). The wisdom of crowds, Anchor, 2004.

Turhan, B. (2012). On the dataset shift problem in software engineering prediction models, Empirical Software Engineering, 17(1-2), 62-74, 2012.

https://doi.org/10.1007/s10664-011-9182-8

Uyar, A.; Bener, A.; Ciray, H. N.; Bahceci, M. (2009). A frequency based encoding technique for transformation of categorical variables in mixed IVF dataset, 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society: Engineering the Future of Biomedicine, 6214-6217, 2009.

https://doi.org/10.1109/IEMBS.2009.5334548

Wang, H.; Wang, W.; Liu, Y.; Alidaee, B. (2022). Integrating Machine Learning Algorithms With Quantum Annealing Solvers for Online Fraud Detection, IEEE Access, 10,75908-75917, 2022.

https://doi.org/10.1109/ACCESS.2022.3190897

Zhao, X.-M.; Li, X.; Chen, L.; Aihara, K. (2007). Protein classification with imbalanced data, Proteins, 70(4), 1125-1132, 2007.

https://doi.org/10.1002/prot.21870

Zhou, X. (2015). Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, Sociol Methodology, 45(1), 320-356, 2015.

https://doi.org/10.1177/0081175015570097