Abstract
Missing observations of fine particulate matter (PM2.5) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM2.5 concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM2.5 in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM2.5 concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM2.5 patterns in multi-sites without external data. Missing PM2.5 data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.
