Addressing Missing Data in Environmental Technologies: Optimizing Air Quality Monitoring with Random Forest and MissForest

Authors

  • Titin Agustin Nengsih UIN Sulthan Thaha Saifuddin Jambi, Indonesia
  • Indrawata Wardhana UIN Sulthan Thaha Saifuddin Jambi, Indonesia
  • M. Nazori M. Nazori Madjid3 UIN Sulthan Thaha Saifuddin Jambi, Indonesia

DOI:

https://doi.org/10.21771/jrtppi.2025.v16.no1.p23-31

Keywords:

Air Quality, Imputation, Missing Values, Random Forest , missForest

Abstract

  Air quality monitoring often encounters missing data issues due to technical glitches, equipment malfunctions, or other causes. This study employs PM2.5 and PM10 datasets from station 6, calculating multiple weighted probabilities for imputation. With missing values introduced at rates of 10, 40, and 70 percents through different amputation methods, the Random Forest and missForest techniques are utilized for imputation. Notably, missForest consistently outperforms Random Forest across all scenarios, yielding accuracy exceeding 96% even with high missing data levels. MissForest achieves remarkable accuracy above 96% for PM2.5 and PM10 across left, middle, and right multiple weight probabilities amputations. Overall, missForest attains the highest accuracy (over 97%) for Air Quality Index at lower and middle missing value proportions.

References

Alsaber, A. R., Pan, J., & Al-Hurban, A. (2021). Handling complex missing data using random forest approach for an air quality monitoring dataset: A case study of kuwait environmental data (2012 to 2018). International Journal of Environmental Research and Public Health, 18(3), 1–26.

Avalos, S., & Ortiz, J. M. (2020). Recursive convolutional neural networks in a multiple-point statistics framework. Computers and Geosciences, 141(May), 104522. https://doi.org/10.1016/j.cageo.2020.104522

Boomgard-Zagrodnik, J. P., & Brown, D. J. (2022). Machine learning imputation of missing Mesonet temperature observations. Computers and Electronics in Agriculture, 192(October 2021), 106580. https://doi.org/10.1016/j.compag.2021.106580

Brand, J. P. L. (1999). Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets.

Burnett, R. T., Arden Pope, C., Ezzati, M., Olives, C., Lim, S. S., Mehta, S., Shin, H. H., Singh, G., Hubbell, B., Brauer, M., Ross Anderson, H., Smith, K. R., Balmes, J. R., Bruce, N. G., Kan, H., Laden, F., Prüss-Ustün, A., Turner, M. C., Gapstur, S. M., Diver, W. R., & Cohen, A. (2014). An integrated risk function for estimating the global burden of disease attributable to ambient fine particulate matter exposure. Environmental Health Perspectives, 122(4), 397–403. https://doi.org/10.1289/ehp.1307049

Castillo, I., Schmidt-Hieber, J., & Van Der Vaart, A. (2015). Bayesian linear regression with sparse priors. Annals of Statistics, 43(5), 1986–2018. https://doi.org/10.1214/15-AOS1334

Chen, M., Zhu, H., Chen, Y., & Wang, Y. (2022). A Novel Missing Data Imputation Approach for Time Series Air Quality Data Based on Logistic Regression. Atmosphere, 13(7). https://doi.org/10.3390/atmos13071044

Chu, H. J., & Bilal, M. (2019). PM 2.5 mapping using integrated geographically temporally weighted regression (GTWR) and random sample consensus (RANSAC) models. Environmental Science and Pollution Research, 26(2), 1902–1910. https://doi.org/10.1007/S11356-018-3763-7/METRICS

Deng, Y., Chang, C., Seyoum Ido, M., & Long, Q. (2016). Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data OPEN. Nature Publishing Group. https://doi.org/10.1038/srep21689

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9

Halme, A. S., & Tannenbaum, C. (2018). Performance of a Bayesian Approach for Imputing Missing Data on the SF-12 Health-Related Quality-of-Life Measure. Value in Health, 21(12), 1406–1412. https://doi.org/10.1016/j.jval.2018.06.007

Hirabayashi, S., & Kroll, C. N. (2017). Single imputation method of missing air quality data for i-Tree Eco analyses in the conterminous United States. Environmetal Research Engineering, 1, 1–24.

Hong, S., & Lynn, H. S. (2020). Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Medical Research Methodology, 20(1), 1–12. https://doi.org/10.1186/s12874-020-01080-1

Idri, A., Abnane, I., & Abran, A. (2018). Support vector regression-based imputation in analogy-based software development effort estimation. Journal of Software: Evolution and Process, 30(12), 1–23. https://doi.org/10.1002/smr.2114

Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for air pollutants. Atmospheric Environment, 102, 96–104. https://doi.org/10.1016/j.atmosenv.2014.11.049

Karmitsa, N., Taheri, S., Bagirov, A., & Makinen, P. (2022). MAR. IEEE Transactions on Knowledge and Data Engineering, 34(4), 1889–1901. https://doi.org/10.1109/TKDE.2020.3001694

Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: an improved missing data imputation technique. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00313-w

Li, Y., Jiang, Y., Yang, C., Yu, M., Kamal, L., Armstrong, E. M., Huang, T., Moroni, D., & McGibbney, L. J. (2020). Improving search ranking of geospatial data based on deep learning using user behavior data. Computers and Geosciences, 142(October 2019), 104520. https://doi.org/10.1016/j.cageo.2020.104520

Norazian, M. N., Shukri, Y. A., Azam, R. N., & Al Bakri, A. M. M. (2008). Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia, 34(3), 341–345. https://doi.org/10.2306/scienceasia1513-1874.2008.34.341

Quinteros, M. E., Lu, S., Blazquez, C., Cárdenas-R, J. P., Ossa, X., Delgado-Saborit, J. M., Harrison, R. M., & Ruiz-Rudolph, P. (2019). Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile. Atmospheric Environment, 200(November 2018), 40–49. https://doi.org/10.1016/j.atmosenv.2018.11.053

Schouten, R. M., Lugtig, P., & Vink, G. (2018). Generating missing values for simulation purposes: a multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88(15), 2909–2930. https://doi.org/10.1080/00949655.2018.1491577

Schouten, R. M., & Vink, G. (2018). The Dance of the Mechanisms: How Observed Information Influences the Validity of Missingness Assumptions: Https://Doi.Org/10.1177/0049124118799376, 50(3), 1243–1258. https://doi.org/10.1177/0049124118799376

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/BIOINFORMATICS/BTR597

von Hippel, P. T. (2020). How Many Imputations Do You Need? A Two-stage Calculation Using a Quadratic Rule. Sociological Methods and Research, 49(3), 699–718. https://doi.org/10.1177/0049124117747303

von Hippel, P. T., & Bartlett, J. W. (2021). Maximum Likelihood Multiple Imputation: Faster Imputations and Consistent Standard Errors Without Posterior Draws. Statistical Science, 36(3), 400–420. https://doi.org/10.1214/20-STS793

Wardhana, I., Ariawijaya, M., Hasnur, R., Syafitri, R., & Nasuha, A. (2021). Design and analysis security architecture virtualization OpenVz. Journal of Physics: Conference Series, 1940(1). https://doi.org/10.1088/1742-6596/1940/1/012088

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067

Zainuri, N. A., Jemain, A. A., & Muda, N. (2015). A comparison of various imputation methods for missing values in air quality data. Sains Malaysiana, 44(3), 449–456. https://doi.org/10.17576/jsm-2015-4403-17

Zhang, S., Gong, L., Zeng, Q., Li, W., Xiao, F., & Lei, J. (2021). Imputation of GPS coordinate time series using missforest. Remote Sensing, 13(12), 1–18. https://doi.org/10.3390/rs13122312

Zhao, Y., & Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research, 25(5), 2021–2035. https://doi.org/10.1177/0962280213511027

Zhou, X., Liu, X., Lan, G., & Wu, J. (2021). Federated conditional generative adversarial nets imputation method for air quality missing data. Knowledge-Based Systems, 228, 107261. https://doi.org/10.1016/j.knosys.2021.107261

Downloads

Published

2025-05-28

How to Cite

Nengsih, T. A., Wardhana, I., & M. Nazori Madjid3, M. N. (2025). Addressing Missing Data in Environmental Technologies: Optimizing Air Quality Monitoring with Random Forest and MissForest. Jurnal Riset Teknologi Pencegahan Pencemaran Industri, 16(1), 23–31. https://doi.org/10.21771/jrtppi.2025.v16.no1.p23-31