设为首页 加入收藏 登录旧版
湖南空气质量预报中的数据预处理和特征工程
Data Preprocessing and Feature Engineering of Air Quality Forecast in Hunan Province
投稿时间:2022-03-07  修订日期:2022-10-26
DOI:10.19316/j.issn.1002-6002.2023.04.18
中文关键词:  机器学习  数据预处理  特征工程  空气质量预报
英文关键词:machine learning  data preprocessing  feature engineering  air quality forecast
基金项目:湖南省气象局2020年重点课题(XQKJ20A001);国家自然科学基金项目(41271095)
作者单位
李细生 气象防灾减灾湖南省重点实验室, 湖南 长沙 410118
株洲市气象局, 湖南 株洲 412003 
陈媛 气象防灾减灾湖南省重点实验室, 湖南 长沙 410118 
罗慧妮 株洲市气象局, 湖南 株洲 412003 
张克非 株洲市气象局, 湖南 株洲 412003 
喻雨知 长沙市气象局, 湖南 长沙 410017 
李巧媛 气象防灾减灾湖南省重点实验室, 湖南 长沙 410118 
张华 株洲市气象局, 湖南 株洲 412003 
易飞 株洲市气象局, 湖南 株洲 412003 
摘要点击次数: 406
全文下载次数: 246
中文摘要:
      为提高空气质量预报的准确率,建立了融合气象和环境观测资料、结合机器学习和数值天气预报,且预测时效较长、预测精度较高的机器学习模型库。以湖南6个城市(长沙、株洲、湘潭、益阳、常德、岳阳)的空气质量预报为例,将数据预处理、特征工程方法运用到模型之中,得出以下几点结论:①数据预处理工作包括样本收集、数据清洗、缺失值处理、异常值剔除等,对提高模型预测稳定性帮助很大。②点、线、面的特征组合有助于完整地描述污染物的生消过程。引入传输指数后,株洲市模型对传输型污染过程的预测性能得到明显提高,对轻度、中度、重度污染的分类准确度分别提升了23.6%、16.6%、30.0%。引入静稳指数后,长沙市模型PM2.5浓度测试的相关系数由0.938提升至0.959,均方根误差由10.33下降至8.46,且模型对中度以上污染天气的极值预报结果更接近实况;益阳市模型在高浓度样本预测中存在的系统性偏低现象得到改善,对轻度以上污染天气的预报结果得到较大矫正。③随机森林的特征重要性排序功能可以大幅度减少特征的数量,使得模型的可解释性和稳定性增强。
英文摘要:
      In order to improve the accuracy of air quality prediction,a machine learning model library is established,which integrates meteorological and environmental observation data,combines machine learning and numerical weather prediction,and has a longer prediction time and higher prediction accuracy.Taking the air quality forecast of six cities in Hunan Province (Changsha,Zhuzhou,Xiangtan,Yiyang,Changde,Yueyang) as an example,data preprocessing and feature engineering methods are applied to the model,and the following conclusions are drawn:① Data preprocessing includes sample collection,data cleaning,missing value processing,outlier elimination,etc.,which helps greatly to improve the stability of the model prediction.② The combination of point,line,and surface features helps to fully describe the process of pollutant generation.After constructing the transmission index C for the model of Zhuzhou City,the prediction performance of transmission type pollution processes was significantly improved,with a classification accuracy improvement of 23.6%,16.6%,and 30.0% for mild,moderate,and severe pollution processes,respectively.After constructing the static stability index W,the correlation coefficient of PM2.5 concentration test of model of Changsha City increased from 0.938 to 0.959,and the root mean square error decreased from 10.33 to 8.46.The extreme value forecast of moderately polluted weather was closer to the actual situation.The model of Yiyang City improved the systematic low prediction of high concentration samples,which is of great help in correcting the prediction of mild or above polluted weather.③ The importance ranking of features based on random forest can greatly reduce the number of features,and enhance the interpretability and stability of the model.
查看全文  查看/发表评论  下载PDF阅读器