Pandas空值识别,空值填充,空值过滤方法超详细解读
目录
Pandas默认空值处理方式
空值识别函数
isnull()/notnull()
用于查看不同行或列的空值个数
用于查看不同行或列的空值占比
空值填充函数
fillna()
使用0来填充空值
使用每列均值来填充空值
使用每列前面一个值来填充空值
使用每列后面一个值来填充空值
使用每行前一个值来填充空值
使用每行后一个值来填充空值
interpolate()
空值过滤函数
使用isnull()/notnull+any()/all()进行过滤
筛选出不含空值的行
筛选出有空值的行
dropna()
Pandas默认空值处理方式
pandas中默认会将None值转为np.nan进行处理,np.nan为float类型数据
以下为本篇博客使用的带有空值的DataFrame,可以看到输出的DataFrame中的NaN就是np.nan
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1)
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
# 0 1 2 3 4 5
# 0 16 29 24.0 20.0 17.0 16.0
# 1 28 20 NaN 13.0 17.0 12.0
# 2 11 21 15.0 11.0 10.0 21.0
# 3 21 26 19.0 25.0 24.0 NaN
# 4 28 21 29.0 NaN NaN 28.0
# 5 16 18 16.0 27.0 13.0 23.0
空值识别函数
isnull()/notnull()
isnull()的作用是将DataFrame中为空值的地方变为True,其余地方为False,notnull()则相反,不为空值的地方为True,其余为False
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1)
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull())
print(data.notnull())
# 0 1 2 3 4 5
# 0 16 29 24.0 20.0 17.0 16.0
# 1 28 20 NaN 13.0 17.0 12.0
# 2 11 21 15.0 11.0 10.0 21.0
# 3 21 26 19.0 25.0 24.0 NaN
# 4 28 21 29.0 NaN NaN 28.0
# 5 16 18 16.0 27.0 13.0 23.0
# 0 1 2 3 4 5
# 0 False False False False False False
# 1 False False True False False False
# 2 False False False False False False
# 3 False False False False False True
# 4 False False False True True False
# 5 False False False False False False
# 0 1 2 3 4 5
# 0 True True True True True True
# 1 True True False True True True
# 2 True True True True True True
# 3 True True True True True False
# 4 True True True False False True
# 5 True True True True True True
用于查看不同行或列的空值个数
若需要显示不同行或列的空值个数,由于布尔类型数据在参与运算的时候True会默认转为1,False会默认转为0,所以只需要isnull()再加上sum()聚合函数即可,通过设置其axis参数来进行每行或每列的求和,值就为空值个数
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull().sum(axis=0))
print(data.isnull().sum(axis=1))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a 0
# b 0
# c 1
# d 1
# e 1
# f 1
# dtype: int64
# A 0
# B 1
# C 0
# D 1
# E 2
# F 0
# dtype: int64
可以看到这里值为0的时候是沿行方向进行每一列的求和运算,为1的时候是沿列方向进行每一行的求和运算,得到的就为空值个数
用于查看不同行或列的空值占比
若要查看空值占比同样也只需要加上mean()聚合函数即可,mean()为和再除上个数,在布尔类型数据进行如此运算后,结果就为空值占比,同样也可以指定axis来选择行或列进行计算
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull().mean(axis=0))
print(data.isnull().mean(axis=1))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a 0.000000
# b 0.000000
# c 0.166667
# d 0.166667
# e 0.166667
# f 0.166667
# dtype: float64
# A 0.000000
# B 0.166667
# C 0.000000
# D 0.166667
# E 0.333333
# F 0.000000
# dtype: float64
空值填充函数
空值填充一般会用到fillna()函数
fillna()
fillna()可以选用指定值作为空值填充,通过配置value参数即可设置填充值,并且支持列索引对齐的广播赋值,注意这里不支持行索引对齐的广播赋值机制,即使改变axis的值也不可以,并且支持通过配置method参数进行指定方式的空值赋值,method可选的有‘ffill’和‘bfill’,分别指用前一个值来填充空值和用后一个值来填充空值
使用0来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(value=0))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 0.0 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 0.0
# E 28 21 29.0 0.0 0.0 28.0
# F 16 18 16.0 27.0 13.0 23.0
使用每列均值来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(value=data.mean(axis = 0)))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 20.6 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 20.0
# E 28 21 29.0 19.2 16.2 28.0
# F 16 18 16.0 27.0 13.0 23.0
使用每列前面一个值来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='ffill',axis=0))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 24.0 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 21.0
# E 28 21 29.0 25.0 24.0 28.0
# F 16 18 16.0 27.0 13.0 23.0
这里设置了axis为0表示沿行方向计算
使用每列后面一个值来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='bfill',axis=0))
#
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 15.0 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 28.0
# E 28 21 29.0 27.0 13.0 28.0
# F 16 18 16.0 27.0 13.0 23.0
使用每行前一个值来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='ffill',axis=1))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16.0 29.0 24.0 20.0 17.0 16.0
# B 28.0 20.0 20.0 13.0 17.0 12.0
# C 11.0 21.0 15.0 11.0 10.0 21.0
# D 21.0 26.0 19.0 25.0 24.0 24.0
# E 28.0 21.0 29.0 29.0 29.0 28.0
# F 16.0 18.0 16.0 27.0 13.0 23.0
这里axis为1表示沿列方向计算
使用每行后一个值来填充空值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='bfill',axis=1))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16.0 29.0 24.0 20.0 17.0 16.0
# B 28.0 20.0 13.0 13.0 17.0 12.0
# C 11.0 21.0 15.0 11.0 10.0 21.0
# D 21.0 26.0 19.0 25.0 24.0 NaN
# E 28.0 21.0 29.0 28.0 28.0 28.0
# F 16.0 18.0 16.0 27.0 13.0 23.0
interpolate()
interpolate()可以进行线性插值,通过设置axis来选择进行行方向的线性插值还是列方向的线性插值
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.interpolate(axis=0))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 19.5 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 24.5
# E 28 21 29.0 26.0 18.5 28.0
# F 16 18 16.0 27.0 13.0 23.0
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.interpolate(axis=1))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16.0 29.0 24.0 20.000000 17.000000 16.0
# B 28.0 20.0 16.5 13.000000 17.000000 12.0
# C 11.0 21.0 15.0 11.000000 10.000000 21.0
# D 21.0 26.0 19.0 25.000000 24.000000 24.0
# E 28.0 21.0 29.0 28.666667 28.333333 28.0
# F 16.0 18.0 16.0 27.000000 13.000000 23.0
空值过滤函数
使用isnull()/notnull+any()/all()进行过滤
筛选出不含空值的行
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.loc[data.notnull().all(axis=1)])
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# C 11 21 15.0 11.0 10.0 21.0
# F 16 18 16.0 27.0 13.0 23.0
筛选出有空值的行
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.loc[data.isnull().any(axis=1)])
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c d e f
# B 28 20 NaN 13.0 17.0 12.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
dropna()
dropna()相当于是集成后的删除空值所在列或行的函数,
axis参数用于指定是删除行还是列,为0表示删除有空值的行,为1代表删除有空值的列,
how参数可以指定为‘any’或‘all’,any表示有空值就删,all表示全为空则删,
subset参数要求赋值一个列表,比如在进行行删除的时候,可以选择用于判断是否具有空值的列,反之在进行列删除的时候可以选择判断是否具有空值的行,
inplace参数表示是否覆盖原内存
import pandas as pd
import numpy as np
np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.dropna(axis=1,how='any',subset=['E','D']))
print(data.dropna(axis=0,how='any',subset=['e','d']))
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# E 28 21 29.0 NaN NaN 28.0
# F 16 18 16.0 27.0 13.0 23.0
# a b c
# A 16 29 24.0
# B 28 20 NaN
# C 11 21 15.0
# D 21 26 19.0
# E 28 21 29.0
# F 16 18 16.0
# a b c d e f
# A 16 29 24.0 20.0 17.0 16.0
# B 28 20 NaN 13.0 17.0 12.0
# C 11 21 15.0 11.0 10.0 21.0
# D 21 26 19.0 25.0 24.0 NaN
# F 16 18 16.0 27.0 13.0 23.0