Positive matrix factorization outperforms machine learning in imputing missing PM₂.₅ and further identifying spatial patterns in multi-sites without external data

Abstract

Missing observations of fine particulate matter (PM_2.5) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM_2.5 concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM_2.5 in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM_2.5 concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM_2.5 patterns in multi-sites without external data. Missing PM_2.5 data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.

대기화학 모니터링 및 모델링 연구실 홈페이지ACMM Laboratory

Positive matrix factorization outperforms machine learning in imputing missing PM₂.₅ and further identifying spatial patterns in multi-sites without external data

Positive matrix factorization outperforms machine learning in imputing missing PM₂.₅ and further identifying spatial patterns in multi-sites without external data

Abstract

대기화학 모니터링 및 모델링 연구실 홈페이지