Addressing Missing Data in Environmental Technologies: Optimizing Air Quality Monitoring with Random Forest and MissForest
DOI:
https://doi.org/10.21771/jrtppi.2025.v16.no1.p23-31Keywords:
Air Quality, Imputation, Missing Values, Random Forest , missForestAbstract
Air quality monitoring often encounters missing data issues due to technical glitches, equipment malfunctions, or other causes. This study employs PM2.5 and PM10 datasets from station 6, calculating multiple weighted probabilities for imputation. With missing values introduced at rates of 10, 40, and 70 percents through different amputation methods, the Random Forest and missForest techniques are utilized for imputation. Notably, missForest consistently outperforms Random Forest across all scenarios, yielding accuracy exceeding 96% even with high missing data levels. MissForest achieves remarkable accuracy above 96% for PM2.5 and PM10 across left, middle, and right multiple weight probabilities amputations. Overall, missForest attains the highest accuracy (over 97%) for Air Quality Index at lower and middle missing value proportions.References
Alsaber, A. R., Pan, J., & Al-Hurban, A. (2021). Handling complex missing data using random forest approach for an air quality monitoring dataset: A case study of kuwait environmental data (2012 to 2018). International Journal of Environmental Research and Public Health, 18(3), 1–26.
Avalos, S., & Ortiz, J. M. (2020). Recursive convolutional neural networks in a multiple-point statistics framework. Computers and Geosciences, 141(May), 104522. https://doi.org/10.1016/j.cageo.2020.104522
Boomgard-Zagrodnik, J. P., & Brown, D. J. (2022). Machine learning imputation of missing Mesonet temperature observations. Computers and Electronics in Agriculture, 192(October 2021), 106580. https://doi.org/10.1016/j.compag.2021.106580
Brand, J. P. L. (1999). Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets.
Burnett, R. T., Arden Pope, C., Ezzati, M., Olives, C., Lim, S. S., Mehta, S., Shin, H. H., Singh, G., Hubbell, B., Brauer, M., Ross Anderson, H., Smith, K. R., Balmes, J. R., Bruce, N. G., Kan, H., Laden, F., Prüss-Ustün, A., Turner, M. C., Gapstur, S. M., Diver, W. R., & Cohen, A. (2014). An integrated risk function for estimating the global burden of disease attributable to ambient fine particulate matter exposure. Environmental Health Perspectives, 122(4), 397–403. https://doi.org/10.1289/ehp.1307049
Castillo, I., Schmidt-Hieber, J., & Van Der Vaart, A. (2015). Bayesian linear regression with sparse priors. Annals of Statistics, 43(5), 1986–2018. https://doi.org/10.1214/15-AOS1334
Chen, M., Zhu, H., Chen, Y., & Wang, Y. (2022). A Novel Missing Data Imputation Approach for Time Series Air Quality Data Based on Logistic Regression. Atmosphere, 13(7). https://doi.org/10.3390/atmos13071044
Chu, H. J., & Bilal, M. (2019). PM 2.5 mapping using integrated geographically temporally weighted regression (GTWR) and random sample consensus (RANSAC) models. Environmental Science and Pollution Research, 26(2), 1902–1910. https://doi.org/10.1007/S11356-018-3763-7/METRICS
Deng, Y., Chang, C., Seyoum Ido, M., & Long, Q. (2016). Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data OPEN. Nature Publishing Group. https://doi.org/10.1038/srep21689
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9
Halme, A. S., & Tannenbaum, C. (2018). Performance of a Bayesian Approach for Imputing Missing Data on the SF-12 Health-Related Quality-of-Life Measure. Value in Health, 21(12), 1406–1412. https://doi.org/10.1016/j.jval.2018.06.007
Hirabayashi, S., & Kroll, C. N. (2017). Single imputation method of missing air quality data for i-Tree Eco analyses in the conterminous United States. Environmetal Research Engineering, 1, 1–24.
Hong, S., & Lynn, H. S. (2020). Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Medical Research Methodology, 20(1), 1–12. https://doi.org/10.1186/s12874-020-01080-1
Idri, A., Abnane, I., & Abran, A. (2018). Support vector regression-based imputation in analogy-based software development effort estimation. Journal of Software: Evolution and Process, 30(12), 1–23. https://doi.org/10.1002/smr.2114
Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for air pollutants. Atmospheric Environment, 102, 96–104. https://doi.org/10.1016/j.atmosenv.2014.11.049
Karmitsa, N., Taheri, S., Bagirov, A., & Makinen, P. (2022). MAR. IEEE Transactions on Knowledge and Data Engineering, 34(4), 1889–1901. https://doi.org/10.1109/TKDE.2020.3001694
Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: an improved missing data imputation technique. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00313-w
Li, Y., Jiang, Y., Yang, C., Yu, M., Kamal, L., Armstrong, E. M., Huang, T., Moroni, D., & McGibbney, L. J. (2020). Improving search ranking of geospatial data based on deep learning using user behavior data. Computers and Geosciences, 142(October 2019), 104520. https://doi.org/10.1016/j.cageo.2020.104520
Norazian, M. N., Shukri, Y. A., Azam, R. N., & Al Bakri, A. M. M. (2008). Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia, 34(3), 341–345. https://doi.org/10.2306/scienceasia1513-1874.2008.34.341
Quinteros, M. E., Lu, S., Blazquez, C., Cárdenas-R, J. P., Ossa, X., Delgado-Saborit, J. M., Harrison, R. M., & Ruiz-Rudolph, P. (2019). Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile. Atmospheric Environment, 200(November 2018), 40–49. https://doi.org/10.1016/j.atmosenv.2018.11.053
Schouten, R. M., Lugtig, P., & Vink, G. (2018). Generating missing values for simulation purposes: a multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88(15), 2909–2930. https://doi.org/10.1080/00949655.2018.1491577
Schouten, R. M., & Vink, G. (2018). The Dance of the Mechanisms: How Observed Information Influences the Validity of Missingness Assumptions: Https://Doi.Org/10.1177/0049124118799376, 50(3), 1243–1258. https://doi.org/10.1177/0049124118799376
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/BIOINFORMATICS/BTR597
von Hippel, P. T. (2020). How Many Imputations Do You Need? A Two-stage Calculation Using a Quadratic Rule. Sociological Methods and Research, 49(3), 699–718. https://doi.org/10.1177/0049124117747303
von Hippel, P. T., & Bartlett, J. W. (2021). Maximum Likelihood Multiple Imputation: Faster Imputations and Consistent Standard Errors Without Posterior Draws. Statistical Science, 36(3), 400–420. https://doi.org/10.1214/20-STS793
Wardhana, I., Ariawijaya, M., Hasnur, R., Syafitri, R., & Nasuha, A. (2021). Design and analysis security architecture virtualization OpenVz. Journal of Physics: Conference Series, 1940(1). https://doi.org/10.1088/1742-6596/1940/1/012088
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
Zainuri, N. A., Jemain, A. A., & Muda, N. (2015). A comparison of various imputation methods for missing values in air quality data. Sains Malaysiana, 44(3), 449–456. https://doi.org/10.17576/jsm-2015-4403-17
Zhang, S., Gong, L., Zeng, Q., Li, W., Xiao, F., & Lei, J. (2021). Imputation of GPS coordinate time series using missforest. Remote Sensing, 13(12), 1–18. https://doi.org/10.3390/rs13122312
Zhao, Y., & Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research, 25(5), 2021–2035. https://doi.org/10.1177/0962280213511027
Zhou, X., Liu, X., Lan, G., & Wu, J. (2021). Federated conditional generative adversarial nets imputation method for air quality missing data. Knowledge-Based Systems, 228, 107261. https://doi.org/10.1016/j.knosys.2021.107261
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Titin Agustin Nengsih, Indrawata Wardhana, M. Nazori M. Nazori Madjid3

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.