XU Ning, LI Weijia, ZHOU Bo, LIU Yun, LI Jie
[Objective] The cost of distribution network engineering is influenced by multidimensional factors such as scale and capacity, equipment and material costs, and geographical conditions. Traditional statistical methods (e.g., linear regression) struggle to handle high-dimensional nonlinear data effectively, while existing machine learning approaches, despite incorporating feature reduction techniques, still exhibit limitations. For instance, principal component analysis (PCA) sacrifices prediction accuracy for dimensionality reduction, and grey relational analysis (GRA) ignores feature interactions. Therefore, there is an urgent need for a prediction method that retains critical feature information while accounting for complex inter-feature relationships. This study integrated recursive feature elimination (RFE) with the random forest (RF) algorithm to develop a RFE-RF prediction model, aiming to resolve feature redundancy and nonlinear modeling challenges. [Methods] A technical framework of “feature selection-model construction-experimental validation” was adopted. For feature selection, the recursive feature elimination (RFE) method was employed, which iterated training models to gradually eliminate features with minimal predictive contributions, retaining an optimal feature subset. For model construction, the RF algorithm was utilized. Based on ensemble learning principles, RF constructed multiple decision trees and averaged their outputs, effectively mitigating overfitting and enhancing model robustness. RF was insensitive to noisy data and quantified feature importance, providing reliable feature ranking criteria for RFE. By embedding RFE into the RF training process, a closed-loop optimization workflow was established. [Results] Experimental validation used data from 190 distribution network engineering projects provided by a power grid company, covering 21 initial features such as voltage level, line length, and equipment costs. Categorical features were numerically encoded while preserving their original distribution characteristics. Through five-fold cross-validation and root mean square error (RMSE) optimization, the optimal feature subset was identified as 12 optimal feature subsets, including such key factors as line length, comprehensive cable price, and voltage level. Compared with traditional linear regression (LR), RF, and mutual information-based RF (MI-RF) algorithms, the RFE-RF algorithm achieves a mean absolute error (MAE) of 8.6579 and a mean absolute percentage error (MAPE) of 6.97% on the test set, significantly outperforming other algorithms. The MAE of RFE-RF on the test set increases by only about 4.5% compared to the training set, indicating lower overfitting risks and demonstrating that feature selection effectively enhances model stability. [Conclusion] Feature selection is pivotal for improving the accuracy of distribution network cost prediction. RFE dynamically eliminates redundant features through iterative processes, substantially reducing data dimensionality and noise interference. The RFE-RF model combines high precision with strong interpretability, reduces MAE significantly compared to traditional models, and clearly quantifies the impact weights of individual features on costs. This study marks the application of combining RFE and RF in cost prediction for distribution network engineering, addressing challenges in feature interaction and redundancy filtering and providing a new paradigm for data modeling in complex engineering systems. The model serves as a precise cost prediction tool for power grid enterprises, aiding investment decisions and cost control, thus advancing intelligent and refined construction of distribution networks. Moreover, it reveals the impact mechanism of feature selection on the generalization capability of machine learning models, offering practical references for feature optimization in high-dimensional nonlinear datasets.