Pandas实例————分析北京市pm指数


要求

根据提供的csv文件,计算北京空气质量数据:

1.汇总计算PM指数年平均值的变化情况
2.汇总计算每年中1-12月的PM指数数据变化情况

csv文件详见附件

分析

分析问题

问题1实际是求PM指数的年平均值,即按年分组计算平均值;
问题2实际是求PM指数的月平均值,即按年和月分组计算平均值。

分析csv文件

import numpy as np
import pandas as pd

# 打开文件
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8')

# 禁止省略列信息
pd.set_option('display.max_columns', None) 
# 打印信息
print("--------------head--------------")
print(df.head())
print("------------describe------------")
print(df.describe())
print("--------------info--------------")
print(df.info())
print("================================")

执行后得到以下信息:

--------------head--------------
   No  year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  \
0   1  2010      1    1     0       4        NaN            NaN   
1   2  2010      1    1     1       4        NaN            NaN   
2   3  2010      1    1     2       4        NaN            NaN   
3   4  2010      1    1     3       4        NaN            NaN   
4   5  2010      1    1     4       4        NaN            NaN   

   PM_Nongzhanguan  PM_US Post  DEWP  HUMI    PRES  TEMP cbwd    Iws  \
0              NaN         NaN -21.0  43.0  1021.0 -11.0   NW   1.79   
1              NaN         NaN -21.0  47.0  1020.0 -12.0   NW   4.92   
2              NaN         NaN -21.0  43.0  1019.0 -11.0   NW   6.71   
3              NaN         NaN -21.0  55.0  1019.0 -14.0   NW   9.84   
4              NaN         NaN -20.0  51.0  1018.0 -12.0   NW  12.97   

   precipitation  Iprec  
0            0.0    0.0  
1            0.0    0.0  
2            0.0    0.0  
3            0.0    0.0  
4            0.0    0.0  
------------describe------------
                 No          year         month           day          hour  \
count  52584.000000  52584.000000  52584.000000  52584.000000  52584.000000   
mean   26292.500000   2012.499772      6.523962     15.726609     11.500000   
std    15179.837614      1.707485      3.448452      8.798896      6.922252   
min        1.000000   2010.000000      1.000000      1.000000      0.000000   
25%    13146.750000   2011.000000      4.000000      8.000000      5.750000   
50%    26292.500000   2012.000000      7.000000     16.000000     11.500000   
75%    39438.250000   2014.000000     10.000000     23.000000     17.250000   
max    52584.000000   2015.000000     12.000000     31.000000     23.000000   

             season     PM_Dongsi  PM_Dongsihuan  PM_Nongzhanguan  \
count  52584.000000  25052.000000   20508.000000     24931.000000   
mean       2.491100     89.154439      92.560806        88.643737   
std        1.116988     87.239267      88.027434        88.041166   
min        1.000000      3.000000       3.000000         3.000000   
25%        1.000000     24.000000      28.000000        24.000000   
50%        2.000000     64.000000      68.000000        62.000000   
75%        3.000000    124.000000     127.000000       122.000000   
max        4.000000    737.000000     672.000000       844.000000   

         PM_US Post          DEWP          HUMI          PRES          TEMP  \
count  50387.000000  52579.000000  52245.000000  52245.000000  52579.000000   
mean      95.904241      2.074554     54.602421   1016.465442     12.587040   
std       91.643772     14.222059     25.991338     10.295070     12.098527   
min        1.000000    -40.000000      2.000000    991.000000    -19.000000   
25%       27.000000    -10.000000     31.000000   1008.000000      2.000000   
50%       69.000000      2.000000     55.000000   1016.000000     14.000000   
75%      132.000000     15.000000     78.000000   1025.000000     23.000000   
max      994.000000     28.000000    100.000000   1046.000000     42.000000   

                Iws  precipitation          Iprec  
count  52579.000000   52100.000000   52100.000000  
mean      23.261829      19.258683      19.519008  
std       49.281706    4381.035532    4381.036040  
min        0.450000       0.000000       0.000000  
25%        1.790000       0.000000       0.000000  
50%        4.920000       0.000000       0.000000  
75%       21.020000       0.000000       0.000000  
max      585.600000  999990.000000  999990.000000  
--------------info--------------

RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   No               52584 non-null  int64  
 1   year             52584 non-null  int64  
 2   month            52584 non-null  int64  
 3   day              52584 non-null  int64  
 4   hour             52584 non-null  int64  
 5   season           52584 non-null  int64  
 6   PM_Dongsi        25052 non-null  float64
 7   PM_Dongsihuan    20508 non-null  float64
 8   PM_Nongzhanguan  24931 non-null  float64
 9   PM_US Post       50387 non-null  float64
 10  DEWP             52579 non-null  float64
 11  HUMI             52245 non-null  float64
 12  PRES             52245 non-null  float64
 13  TEMP             52579 non-null  float64
 14  cbwd             52579 non-null  object 
 15  Iws              52579 non-null  float64
 16  precipitation    52100 non-null  float64
 17  Iprec            52100 non-null  float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB
None
================================

经查阅资料,可知各变量含义:

No: 行号
year: 年份
month: 月份
day: 日期
hour: 小时
season: 季节
PM_Dongsi: 来自Dongsi(东四?)的PM2.5浓度 (ug/m^3)
PM_Dongsihuan: 来自Dongsihuan(东四环?)的PM2.5浓度 (ug/m^3)
PM_Nongzhanguan: 来自Nongzhanguan(农展馆?)的PM2.5浓度 (ug/m^3)
PM_US Post: 来自US Post的PM2.5浓度 (ug/m^3)
DEWP: 露点 (摄氏温度) 指在固定气压之下,空气中所含的气态水达到饱和而凝结成液态水所需要降至的温度。
TEMP: Temperature (摄氏温度)
HUMI: 湿度 (%)
PRES: 气压 (hPa)
cbwd: 组合风向
Iws: 累计风速 (m/s)
precipitation: 降水量/时 (mm)
Iprec: 累计降水量 (mm)

不难发现,仅有第2, 3, 7, 8, 9, 10列的数据与题目相关。但四列PM数据都存在不同程度的缺失。

分析缺失数据

import numpy as np
import pandas as pd

# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[6, 7, 8, 9])

# 打印信息

# 总行数
print("Total of rows: {}".format(len(df.index)))

# 显示某列缺失值个数
# print("The number of missing data in PM_Dongsi: {}".format(len(df.index) - len(df['PM_Dongsi'].dropna())))
# 为了书写简便,使用循环,遍历所有列,获得每列缺失值个数
for col in df:
    #dropna()会去除缺省值的数据,遂得到每列的缺失数据数
    print("The number of missing data in {}: {}".format(col, len(df.index) - len(df[col].dropna())))

# 获取所有列均为缺失值的行的个数,how='all'是限定所有列均为缺省值
print("The number of missing data in ALL: ", len(df.index) - len(df.dropna(how='all')))

得到信息:

Total of rows: 52584
The number of missing data in PM_Dongsi: 27532
The number of missing data in PM_Dongsihuan: 32076
The number of missing data in PM_Nongzhanguan: 27653
The number of missing data in PM_US_Post: 2197
The number of missing data in ALL:  1894

可见PM_US_Post的数据最为完整,但是仍存在部分四个值均缺失的情况。
为了数据的可靠性,博主采用四项求平均的方式来确定每日的PM数据。

注意

原文件中变量PM_US Post中包含空格,可能会对命令操作造成干扰,建议将' '替换为'_'

求解

问题1:

import numpy as np
import pandas as pd

# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[1, 6, 7, 8, 9])

# 新建平均值列,并将平均值写入
# 其中,iloc[:, 1:5]指第2到第5列,mean(axis=1)为求行平均值,否则默认为axis=0求列平均值
df['PM_ave'] = df.iloc[:, 1:5].mean(axis=1)
# 保存到文件,其中以'year'分组,计算'PM_ave'列的平均值。
df.groupby('year')['PM_ave'].mean().to_csv("solution1.csv")
# 为方便,再打印一份
print(df.groupby('year')['PM_ave'].mean())

得到:

year    PM_ave
2010    104.0457298
2011    99.09324038
2012    90.53876764
2013    98.40266354
2014    93.9177037
2015    85.85894216

问题2:

import numpy as np
import pandas as pd

# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[1, 2, 6, 7, 8, 9])

# 新建平均值列,并将平均值写入
# 其中,iloc[:, 2:6]指第3到第6列,mean(axis=1)为求行平均值,否则默认为axis=0求列平均值
df['PM_ave'] = df.iloc[:, 2:6].mean(axis=1)
# 保存到文件,其中以'year'和'month'分组,计算'PM_ave'列的平均值。
df.groupby(['year', 'month'])['PM_ave'].mean().to_csv("solution2.csv")
# 为方便,再打印一份
print(df.groupby(['year', 'month'])['PM_ave'].mean())

得到:

year    month    PM_ave
2010    1    90.40366972
2010    2    97.23994039
2010    3    94.04654443
2010    4    80.0724234
2010    5    87.07191316
2010    6    109.0389381
2010    7    123.4260753
2010    8    97.68343195
2010    9    122.792735
2010    10    118.7843666
2010    11    138.3840361
2010    12    97.11574697
2011    1    44.87369985
2011    2    150.2901786
2011    3    57.99198718
2011    4    91.72067039
2011    5    65.10814607
2011    6    108.7946554
2011    7    107.3864865
2011    8    103.7338004
2011    9    94.96940195
2011    10    145.5568182
2011    11    109.434965
2011    12    108.7213997
2012    1    118.9223881
2012    2    84.44202899
2012    3    96.47432432
2012    4    87.83588317
2012    5    90.96671491
2012    6    96.63418079
2012    7    80.6497093
2012    8    81.16532905
2012    9    59.95224719
2012    10    94.95135135
2012    11    87.43696275
2012    12    109.1872964
2013    1    183.1952703
2013    2    113.5664683
2013    3    114.5726927
2013    4    63.04780093
2013    5    89.14852151
2013    6    111.3548611
2013    7    74.93283917
2013    8    67.92361111
2013    9    85.71782407
2013    10    102.2087814
2013    11    85.1462963
2013    12    90.31776434
2014    1    107.9117384
2014    2    160.5138889
2014    3    103.1832437
2014    4    92.16064815
2014    5    64.95855735
2014    6    59.15462963
2014    7    91.7999552
2014    8    65.66823688
2014    9    68.23263889
2014    10    135.2697133
2014    11    106.3375
2014    12    76.62253584
2015    1    110.0227375
2015    2    103.4455605
2015    3    94.48342294
2015    4    79.39699074
2015    5    61.16756272
2015    6    60.33240741
2015    7    60.22950269
2015    8    45.89605735
2015    9    50.92476852
2015    10    77.2578405
2015    11    125.803125
2015    12    162.1789875

附件

点击下载


文章作者: BoyInTheSun
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 BoyInTheSun !
 上一篇
V2Ray客户端汇总 V2Ray客户端汇总
介绍GFW日益强大,传统的代理服务器、VPN方式早已无法使用,甚至连SS和SSR最近也频繁被识别出,V2Ray作为新兴技术,被发现的风险更低。更多可访问项目官网和github项目地址,目前官网国内已经无法访问。本文只提供客户端下载,使用方式
2020-04-11
下一篇 
Scrapy实例————爬取学堂在线合作院校页面内容 Scrapy实例————爬取学堂在线合作院校页面内容
通过`Scrapy`爬取到合作院校的名称及该所院校在学堂在线开课的数量,将爬取到的数据保存到一个json文件中,例如:“清华大学,308”,地址 http://www.xuetangx.com/partners
2020-04-02
  目录