章节2 行走数据江湖,只需一行代码
目录
- 6. 函数填充,计算列
- 6.1 excel操作
- 6.2 pandas操作1
- 6.3 pandas操作2
- 8. 数据筛选、过滤,[绘图前的必备功课]
- 8.1 excel操作
- 8.2 Python操作
http://sa.mentorx.net 蔓藤教育
6. 函数填充,计算列
书的编号、书的名字、标价、折扣、最终价钱
最终价钱Price=ListPrice * Discount = 标价 * 折扣
6.1 excel操作
6.2 pandas操作1
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
print(books)
填充Price列,在excel中我们操作的是单元格,而pandas中我们操作的是列
- 操作符的重载:比如下面的 *(乘号操作符),当它左右两边是两列时,它就把两列前后对其,一个单元格乘以一个单元格的乘起来。
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
books['Price'] = books['ListPrice'] * books['Discount']
print(books)
乘以 一个数也是可以的:
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
books['Price'] = books['ListPrice'] * 0.8
print(books)
用循环来迭代DataFrame:(有点类似excel的单元格对单元格操作)
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
for i in books.index:
books['Price'].at[i] = books['ListPrice'].at[i] * books['Discount'].at[i]
print(books)
运算的时候,不想从头到尾运算,而是从其中的某一段开始运算
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
for i in range(5, 16):
books['Price'].at[i] = books['ListPrice'].at[i] * books['Discount'].at[i]
print(books)
6.3 pandas操作2
现在,每本书要涨价2元,
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
books['ListPrice'] = books['ListPrice'] + 2
print(books)
调用series的apply()函数来实现上面的功能:
import pandas as pd
def add_2(x):
return x + 2
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
books['ListPrice'] = books['ListPrice'].apply(add_2)
print(books)
将得到上图同样的结果。
进一步简化代码:
import pandas as pd
books = pd.read_excel('C:/Temp/Books.xlsx', index_col='ID')
books['ListPrice'] = books['ListPrice'].apply(lambda x: x + 2)
print(books)
8. 数据筛选、过滤,[绘图前的必备功课]
8.1 excel操作
筛选,18<=年龄<=30的学生的分数状况,且分数>80的学生
全部选中,然后筛选即可。
8.2 Python操作
读取的时候,将 ‘ID’ 作为 index ,
import pandas as pd
students = pd.read_excel('C:/Temp/Students.xlsx', index_col='ID')
筛选数据:用函数的形式来表达条件
pd.series有apply()方法,
import pandas as pd
students = pd.read_excel('C:/Temp/Students.xlsx', index_col='ID')
students = students.loc[students['Age'].apply[age_18_to_30]]
print(students)
import pandas as pd
def age_18_to_30(a):
return 18 <= a <30
def level_a(s):
return 85 <=s <=100
students = pd.read_excel('C:/Temp/Students.xlsx', index_col='ID')
students = students.loc[students['Age'].apply[age_18_to_30]].loc[students['Score'].apply(level_a)]
print(students)
另一种写法:
students = students.loc[students.Age.apply[age_18_to_30]].loc[students.Score.apply(level_a)]
进一步优化代码:
students = students.loc[students.Age.apply[lambda a: 18<=a<30]].loc[students.Score.apply(lambda s: 85<=s<=100)]
代码太长,可以打一个 空格+ ’ \ ',然后回车即可
students = students.loc[students.Age.apply[lambda a: 18<=a<30]] \
.loc[students.Score.apply(lambda s: 85<=s<=100)]