A Feature Engineering and Ensemble Learning Based Approach for Repeated Buyers Prediction

Authors

  • Mingyang Zhang Beijing Forestry University, China
  • Jiayue Lu National Science Library, Chinese Academy of Sciences, Beijing, China
  • Ning Ma Beijing Forestry University, China
  • T.C. Edwin Cheng The Hong Kong Polytechnic University, Hong Kong, China
  • Guowei Hua Beijing Jiaotong University, China

DOI:

https://doi.org/10.15837/ijccc.2022.6.4988

Keywords:

feature engineering; ensemble learning; fusion model; repeat buyer prediction

Abstract

The global e-commerce market is growing at a rapid pace, but the percentage of repeat buyers is low. According to Tmall, the repurchase rate is only 6.1%, while research shows that a 5% increase in the repurchase rate can lead to a 25% to 95% increase in profit. To increase the repurchase rate, merchants need to predict potential repeat buyers and convert them into repurchasers. Therefore, it is necessary to predict repeat buyers. In this paper we build a prediction model of repeat purchasers using Tmall’s dataset. First, we build high-quality feature engineering for e-commerce scenarios by manual construction and algorithmic selection. We introduce the synthetic minority oversampling technique (SMOTE) algorithm to solve the data imbalance problem and improve prediction performance. Then we train classical classifiers including factorization machine and logistic regression, and ensemble learning classifiers including extreme gradient boosting, and light gradient boosting machine machines. Finally, we construct a two-layer fusion model based on the Stacking algorithm to further enhance prediction performance. The results show that through a series of innovations such as data imbalance processing, feature engineering, and fusion models, the model area under curve (AUC) value is improved by 0.01161. Our findings provide important implications for managing e-commerce platforms and the platform merchants.

References

Abel, F.; Gao, Q.; Houben, G. J.; Tao, K. (2011). Analyzing user modeling on twitter for personalized news recommendations. International Conference on User Modeling, Adaptation, And Personalization, 1-2, 2011

https://doi.org/10.1007/978-3-642-22362-4_1

Belem, F. M.; Silva, R. M.; de Andrade, C. M.; Person, G.; Mingote, F.; Ballet, R.; Alponti, H.; de Oliveira, H. P.; Almeida, J. M.; Goncalves, M. A. (2020). "Fixing the curse of the bad product descriptions"-Search-boosted tag recommendation for E-commerce products. Information Processing Management, 57(5), 102289, 2020

https://doi.org/10.1016/j.ipm.2020.102289

Benevenuto, F.; Rodrigues, T.; Cha, M.; Almeida, V. (2009). Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, 49-62, 2009

https://doi.org/10.1145/1644893.1644900

Bhattacharya, C. B. (1998). When customers are members: Customer retention in paid membership contexts. Journal of The Academy of Marketing Science, 26(1), 31-44, 1998

https://doi.org/10.1177/0092070398261004

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2): 123-140, 1996

https://doi.org/10.1007/BF00058655

Cao, W.; Wang, K.; Gan, H.; Yang, M. (2021). User online purchase behavior prediction based on fusion model of CatBoost and Logit. Journal of Physics: Conference Series, 2003(01), 012011, 2021

https://doi.org/10.1088/1742-6596/2003/1/012011

Carta, S.; Fenu, G.; Recupero, D. R.; Saia, R. (2019). Fraud detection for E-commerce transactions by employing a prudential Multiple Consensus model. Journal of Information Security and Applications, 46, 13-22, 2019

https://doi.org/10.1016/j.jisa.2019.02.007

Chen, S.; Wang, J. Q.; Zhang, H. Y. (2019). A hybrid PSO-SVM model based on clustering algorithm for short-term atmospheric pollutant concentration forecasting. Technological Forecasting and Social Change, 146, 41-54, 2019

https://doi.org/10.1016/j.techfore.2019.05.015

Chou, P.; Chuang, H. H. C.; Chou, Y. C.; Liang, T. P. (2022). Predictive analytics for customer repurchase: Interdisciplinary integration of buy till you die modeling and machine learning. European Journal of Operational Research, 296(2), 635-651, 2022

https://doi.org/10.1016/j.ejor.2021.04.021

Daly, J. L. (2002). Pricing for profitability: Activity-based pricing for competitive advantage. John Wiley & Sons, 2002.

Dasarathy, B. V.; Sheela, B. V.(1979). A composite classifier system design: Concepts and methodology. Proceedings of the IEEE, 67(5): 708-713, 1979

https://doi.org/10.1109/PROC.1979.11321

Deng, Z. H.; Huang, L.; Wang, C. D.; Lai, J. H.; Philip, S. Y. (2019). Deepcf: A unified framework of representation learning and matching function learning in recommender system. Proceedings of The AAAI Conference on Artificial Intelligence, 33(01), 61-68, 2019

https://doi.org/10.1609/aaai.v33i01.330161

Dong, J.; Huang, T.; Min, L.; Wang, W. (2022). Prediction of Online Consumers' Repeat Purchase Behavior via BERT-MLP Model. Journal of Electronic Research and Application, 6(3), 12-19, 2022

https://doi.org/10.26689/jera.v6i3.4010

Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14(2), 241-258, 2020

https://doi.org/10.1007/s11704-019-8208-z

Dong, Y.; Jiang, W. (2019). Brand purchase prediction based on time-evolving user behaviors in e-commerce. Concurrency and Computation: Practice and Experience, 31(1), e4882, 2019

https://doi.org/10.1002/cpe.4882

Enrich, M.; Braunhofer, M.; Ricci, F. (2013). Cold-start management with cross-domain collaborative filtering and tags. International Conference on Electronic Commerce and Web Technologies 101-112, 2013

https://doi.org/10.1007/978-3-642-39878-0_10

Fernández-Tobías, I.; Cantador, I. (2014). Exploiting Social Tags in Matrix Factorization Models for Cross-domain Collaborative Filtering. Proceedings of the 1st Workshop on New Trends in Content-based Recommender Systems, 34-41, 2014

Gajsek B.; Dukic G.; Kovacic M.; Brezocnik M. (2021). A Multi-Objective Genetic Algorithms Approach for Modelling of Order Picking. Int. Journal of Simulation Modelling, 20(4), 719-729, 2021

https://doi.org/10.2507/IJSIMM20-4-582

Hansen, L. K.; Salamon, P. (1990). Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10): 993-1001, 1990

https://doi.org/10.1109/34.58871

Jacobs, R.; Jordan, M.; Nowlan, S.; Hinton G. (2014). Adaptive mixtures of local experts. Neural Computation, 3(1): 79-87, 1991

https://doi.org/10.1162/neco.1991.3.1.79

Janekova J.; Fabianova J.; Kadarova J. (2021). Selection of Optimal Investment Variant Based on Monte Carlo Simulations. Int. Journal of Simulation Modelling, 20(2), 279-290, 2021

https://doi.org/10.2507/IJSIMM20-2-557

Kagan, S.; Bekkerman, R. (2018). Predicting purchase behavior of website audiences. International Journal of Electronic Commerce, 22(4), 510-539, 2018

https://doi.org/10.1080/10864415.2018.1485084

Knezevic, B.; Skrobot, P.; Pavic, E. (2021). Differentiation of e-commerce consumer approach by product categories. Journal of Logistics, Informatics and Service Science, 8(1), 1-19, 2021

Kocheturov, A.; Pardalos, P. M.; Karakitsiou, A. (2019). Massive datasets and machine learning for computational biomedicine: trends and challenges. Annals of Operations Research, 276(1), 5-34, 2019

https://doi.org/10.1007/s10479-018-2891-2

Koehn, D.; Lessmann, S.; Schaal, M. (2020). Predicting online shopping behaviour from clickstream data using deep learning. Expert Systems with Applications, 150, 113342, 2020

https://doi.org/10.1016/j.eswa.2020.113342

Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426-434, 2008

https://doi.org/10.1145/1401890.1401944

Kumar, A.; Kabra, G.; Mussada, E. K.; Dash, M. K.; Rana, P. S. (2019). Combined artificial bee colony algorithm and machine learning techniques for prediction of online consumer repurchase intention. Neural Computing and Applications, 31(2), 877-890, 2019

https://doi.org/10.1007/s00521-017-3047-z

Kyriakou, I.; Mousavi, P.; Nielsen, J. P.; Scholz, M. (2021). Forecasting benchmarks of long-term stock returns via machine learning. /emphAnnals of Operations Research, 297(1), 221-240, 2021

https://doi.org/10.1007/s10479-019-03338-4

Li, X.; Hitt, L. M.; Zhang, Z. J. (2011). Product reviews and competition in markets for repeat purchase products. Journal of Management Information Systems, 27(4), 9-42, 2011

https://doi.org/10.2753/MIS0742-1222270401

Liu, X.; Li, J. (2016). Using support vector machine for online purchase predication. Emph2016 International Conference on Logistics, Informatics and Service Sciences, 1-6, 2016

https://doi.org/10.1109/LISS.2016.7854334

Ma X. Y.; Lin Y.; Ma Q. W. (2021). Data-Driven Robust Model for Container Slot Allocation with Uncertain Demand. Int. Journal of Simulation Modelling, 20(4), 707-718, 2021

https://doi.org/10.2507/IJSIMM20-4-581

Martínez, A.; Schmuck, C.; Pereverzyev Jr, S.; Pirker, C.; Haltmeier, M. (2020). A machine learning framework for customer purchase prediction in the non-contractual setting. European Journal of Operational Research, 281(3), 588-596, 2020

https://doi.org/10.1016/j.ejor.2018.04.034

Moriuchi, E.; Takahashi, I. (2022). An empirical study on repeat consumer's shopping satisfaction on C2C e-commerce in Japan: the role of value, trust and engagement. Asia Pacific Journal of Marketing and Logistics, ahead-of-print, 2022

https://doi.org/10.1108/APJML-08-2021-0631

Ni, Y.; Chen, X.; Pan, W.; Chen, Z.; Ming, Z. (2021). Factored heterogeneous similarity model for recommendation with implicit feedback. Neurocomputing, 455(2021), 59-67, 2021

https://doi.org/10.1016/j.neucom.2021.05.009

Oyewole, S. A.; Olugbara, O. O. (2018). Product image classification using Eigen Colour feature with ensemble machine learning. Egyptian Informatics Journal, 19(2), 83-100, 2018

https://doi.org/10.1016/j.eij.2017.10.002

Sagi, O.; Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249, 2018

https://doi.org/10.1002/widm.1249

Sakar, C. O.; Polat, S. O.; Katircioglu, M.; Kastro, Y. (2019). Real-time prediction of online shoppers' purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893-6908, 2019

https://doi.org/10.1007/s00521-018-3523-0

Schapire, R. E.; Freund, Y. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1): 119-139, 1997

https://doi.org/10.1006/jcss.1997.1504

Shen, Y.; Xu, X.; Cao, J. (2020). Reconciling predictive and interpretable performance in repeat buyer prediction via model distillation and heterogeneous classifiers fusion. Neural Computing and Applications, 32(13), 9495-9508, 2020

https://doi.org/10.1007/s00521-019-04462-9

Tripathi, P.; Singh, S.; Chhajer, P.; Trivedi, M. C.; Singh, V. K. (2020). Analysis and prediction of extent of helpfulness of reviews on E-commerce websites. Materials Today: Proceedings, 33, 4520-4525, 2020

https://doi.org/10.1016/j.matpr.2020.08.012

Van Nguyen, T.; Zhou, L.; Chong, A. Y. L.; Li, B.; Pu, X. (2020). Predicting customer demand for remanufactured products: A data-mining approach. European Journal of Operational Research, 281(3), 543-558, 2020

https://doi.org/10.1016/j.ejor.2019.08.015

Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2): 241-259, 1992

https://doi.org/10.1016/S0893-6080(05)80023-1

Wu P. J., Yang D. (2021). E-Commerce Workshop Scheduling Based on Deep Learning and Genetic Algorithm. Int. Journal of Simulation Modelling, 20(1),192-200,2021

https://doi.org/10.2507/IJSIMM20-1-CO4

Xu, J.; Kim, H.K. (2021). A study on the factors influencing consumers' purchase intention towards Chinese beauty industry: focusing on SNS characteristic elements. Journal of Logistics, Informatics and Service Science, 8(2), 47-64, 2021

Yin, X. C.; Liu, C. P.; Han, Z. (2005). Feature combination using boosting. Pattern Recognition Letters, 26(14), 2195-2205, 2005

https://doi.org/10.1016/j.patrec.2005.03.029

Zhang, H.; Dong, J. (2020). Prediction of repeat customers on E-commerce platform based on blockchain. Wireless Communications and Mobile Computing, 2020(8841437), 2020

https://doi.org/10.1155/2020/8841437

Zhang, Z.; Zeng, D. D.; Abbasi, A.; Peng, J.; Zheng, X. (2013). A random walk model for item recommendation in social tagging systems. ACM Transactions on Management Information Systems 4(2), 1-24, 2013

https://doi.org/10.1145/2490860

[Online]. Available: https://www.census.gov/retail/index.html

[Online]. Available: https://www.cnnic.net.cn/n4/2022/0401/c88-1131.html

[Online]. Available: https://tianchi.aliyun.com/competition/entrance/231576/introduction

[Online]. Available: https://github.com/huiminren/RepeatBuyersPrediction

[Online]. Available: https://github.com/leowang7553/repeatBuyersPrediction

[Online]. Available: https://github.com/Ashitemaru/DM-Tmall-prediction

[Online]. Available: https://github.com/DatAvalon/RepeatBuyersPrediction

Additional Files

Published

2022-12-14

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.