yjs12——pandas缺失值的处理
1.缺失值的表示
正常来说,pandas缺失值是“nan”表示,但是有且文件可能自己改成了相应的别的符号
2.如何将缺失值符号改成nan
xxx.replace(to_replace="...",value=np.nan)
3.判断是否有缺失值
1.pd.notnull(xxx)————如果有缺失,则在缺失处返回false
2.pd.isnull(xxx)——————如果有缺失,在缺失处返回True
在判断缺失时,我们常常要的是最后的结果,不是这个表的所有数据的true/false
所以常常搭配 numpy中的all或者any来进行数据的判断
notnull——如果数据有缺失值,该处返回false,np.all在外嵌套,一旦不全是true,就会返回false,所以一旦有缺失,最终返回false np.all(pd.notnull(data))
isnull————一旦有数据缺失,isnull就会在该处返回true;np.any在外嵌套,如果数据存在true,就会返回true。综上所述,一旦有缺失,就会返回true np.any(pd.isnull(data))
4.缺失值的替换、删除
# 替换——xx=xxx.fillna(替换后的值,inplace=True、False) inplace=True,原数据也会替换,=False,原数据不会做出改变# 删除———xx=xxx.dropna() 删除操作不会对原数据进行改变
一般的写法:
for i in data_drop.columns: if np.any(pd.isnull(data_drop[i])) == True: data_drop[i].dropna() print(data_drop)
代码:
# 缺失值的处理
import pandas as pd
import numpy as np
from pandas import DataFrame
# 1.引入文件
data = pd.read_csv("E:/研究生/机器学习/百度云笔记/data/s1.csv")
data_replace = data
data_drop = data
# 2.查看是否有缺失值
print(np.all(pd.notnull(data)))
# notnull——如果数据有缺失值,该处返回false,np.all在外嵌套,一旦不全是true,就会返回false,所以一旦有缺失,最终返回false
print(np.any(pd.isnull(data)))
# isnull————一旦有数据缺失,isnull就会在该处返回true;np.any在外嵌套,如果数据存在true,就会返回true。综上所述,一旦有缺失,就会返回true
"""注意搭配:np.all与notnull搭配,np.any与isnull搭配"""
# 3.缺失值的查找+替换、删除
# 替换——xxx.fillna(替换后的值,inplace=True、False)
for i in data_replace.columns:
if np.any(pd.isnull(data_replace[i])) == True:
data_replace[i].fillna(data_replace[i].mean(), inplace=False)
print(data)
print(data_replace)
# 删除———xxx.dropna()
for i in data_drop.columns:
if np.any(pd.isnull(data_drop[i])) == True:
data_drop[i].dropna()
print(data)
print(data_drop)
# 当缺失值不是nan形式,如何替换?
data1=pd.read_csv("E:/研究生/机器学习/百度云笔记/data/Salary_1.csv")
data1.replace(to_replace="?",value=np.nan)
print(data1)
结果:
False
True
Unnamed: 0 Rk PLAYER ... PACE W SALARY_MILLIONS
0 0 1 Russell Westbrook ... 102.31 46 26.50
1 1 2 James Harden ... 102.98 54 26.50
2 2 3 Isaiah Thomas ... 99.84 51 26.50
3 3 4 Anthony Davis ... 100.19 31 NaN
4 4 6 DeMarcus Cousins ... 97.11 30 NaN
5 5 7 Damian Lillard ... 99.68 38 24.33
6 6 8 LeBron James ... 98.38 51 30.96
7 7 9 Kawhi Leonard ... 95.79 54 31.30[8 rows x 13 columns]
Unnamed: 0 Rk PLAYER ... PACE W SALARY_MILLIONS
0 0 1 Russell Westbrook ... 102.31 46 26.50
1 1 2 James Harden ... 102.98 54 26.50
2 2 3 Isaiah Thomas ... 99.84 51 26.50
3 3 4 Anthony Davis ... 100.19 31 NaN
4 4 6 DeMarcus Cousins ... 97.11 30 NaN
5 5 7 Damian Lillard ... 99.68 38 24.33
6 6 8 LeBron James ... 98.38 51 30.96
7 7 9 Kawhi Leonard ... 95.79 54 31.30[8 rows x 13 columns]
Unnamed: 0 Rk PLAYER ... PACE W SALARY_MILLIONS
0 0 1 Russell Westbrook ... 102.31 46 26.50
1 1 2 James Harden ... 102.98 54 26.50
2 2 3 Isaiah Thomas ... 99.84 51 26.50
3 3 4 Anthony Davis ... 100.19 31 NaN
4 4 6 DeMarcus Cousins ... 97.11 30 NaN
5 5 7 Damian Lillard ... 99.68 38 24.33
6 6 8 LeBron James ... 98.38 51 30.96
7 7 9 Kawhi Leonard ... 95.79 54 31.30[8 rows x 13 columns]
Unnamed: 0 Rk PLAYER ... PACE W SALARY_MILLIONS
0 0 1 Russell Westbrook ... 102.31 46 26.50
1 1 2 James Harden ... 102.98 54 26.50
2 2 3 Isaiah Thomas ... 99.84 51 26.50
3 3 4 Anthony Davis ... 100.19 31 NaN
4 4 6 DeMarcus Cousins ... 97.11 30 NaN
5 5 7 Damian Lillard ... 99.68 38 24.33
6 6 8 LeBron James ... 98.38 51 30.96
7 7 9 Kawhi Leonard ... 95.79 54 31.30[8 rows x 13 columns]
Rk PLAYER POSITION AGE ... PIE PACE W SALARY_MILLIONS
0 1 Russell Westbrook PG 28 ... 23.0 102.31 46 26.5
1 2 James Harden PG 27 ... 19.0 102.98 54 26.5
2 3 Isaiah Thomas PG 27 ... 16.1 99.84 51 26.5
3 4 Anthony Davis C 23 ... 19.2 100.19 31 ?
4 6 DeMarcus Cousins C 26 ... 17.8 97.11 30 16.96
5 7 Damian Lillard PG 26 ... 15.9 99.68 38 24.33
6 8 LeBron James SF 32 ... 18.3 98.38 51 30.96
7 9 Kawhi Leonard SF 25 ... 17.4 95.79 54 ?[8 rows x 12 columns]
进程已结束,退出代码为 0
注意:
1.any与all是np的函数,并且注意 any搭配的是isnull,all搭配的是notnull
2.isnull、notnull和fillna、dropna的写法
pd.~null(数据集) ; 数据集.dropna()3.注意查找+替换的写法,是 for i in data.columns,然后是data[i]...