pandas

DataFrame/Series的创建

pd.DataFrame({'Bob':['I liked it.','It was awful.'],'Sue':['Pretty good.','Bland.']},index=['Product A','Product B'])
pd.Series([30,35,40],index=['2015 Sales','2016 Sales','2017 Sales'],name='ProductA')

访问DataFrame(索引)

条件索引

条件索引用 loc(不能用iloc),注意不要用逻辑运算符,而要用 & | isin isnull/notnull

reviews.loc[ (reviews.country=='Italy') & (reviews.points>=90) ]
reviews.loc[ (reviews.country=='Italy') | (reviews.points>=90) ]
reviews.loc[ reviews.country.**isin**(['Italy','France']) ]

不用loc也可以用条件索引:df.loc[df['column'] > 0]

loc 的优势在于,它可以同时进行行和列选择:

df.loc[df['column'] > 0, 'other_column']

了解数据

reviews.head(n = 5) 返回表头和前n行数据,需要print

.sample(n) 返回随机n行数据

reviews.points.describe() 可以得知一列的计数、平均值、众数等;也可以单独访问每个统计量,如.mean() .median()

reviews.taster_name.unique() 获得一列的去重

reviews.taster_name.value_counts() 获得一列的计数

用map批量操作(只支持调用单个参数):

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p : p - review_points_mean)

.apply() 支持对每一行操作

def remean_points(row):
	row.points=row.points-review_points_mean
	return row

reviews.apply(remean_points,axis='columns') //axis='columns' 意思是按列这条轴往下?

注意map和apply都不修改原表

直接列与列之间批量操作

review_points_mean=reviews.points.mean()
reviews.points-review_points_mean
reviews.country+" - "+reviews.region_1

获得最大值所在位置 idxmax

bargain_idx=(reviews.points/reviews.price).**idxmax()**
bargain_wine=reviews.loc[bargain_idx, 'title']

统计有多少个desc(字符串)中出现了某一关键词

n_trop=reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity=reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts=pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

pandas.DataFrame.where DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors=’raise’, try_cast=False, raise_on_error=None)

data.sort_values("Team", inplace = True) filter1 = data["Team"]=="Atlanta Hawks" filter2 = data["Age"]>24 data.where(filter1 & filter2, inplace = True)

不满足cond的行将被整行替换为 NaN (DataFrame.mask则将满足cond的行整行替换为NaN)

pandas中concat()的用法 - 知乎 (zhihu.com)

pandas 按行/列遍历的方法:

正确方法:

for i, row in df.**iterrows**():
	# i 为行的索引,row是行数据(一个Series), row[0] 为该行的第一列
	print(row[0])

错误方法:(索引不一定连续!)

for i in range(df.shape[0]) :
	print(df.loc[i][0])

按列遍历用 for col in df.columns