pandas

DataFrame/Series的创建

pd.DataFrame({'Bob':['I liked it.','It was awful.'],'Sue':['Pretty good.','Bland.']},index=['Product A','Product B'])
pd.Series([30,35,40],index=['2015 Sales','2016 Sales','2017 Sales'],name='ProductA')

访问DataFrame（索引）

用 data.index 或 data[index] 先列后行
用 iloc 和 loc 先行后列
- iloc和loc的区别： 1.iloc是左闭右开区间，loc是闭区间 2.loc只能标签索引，iloc 只能数字索引

条件索引

条件索引用 loc（不能用iloc），注意不要用逻辑运算符，而要用 & | isin isnull/notnull

reviews.loc[ (reviews.country=='Italy') & (reviews.points>=90) ]
reviews.loc[ (reviews.country=='Italy') | (reviews.points>=90) ]
reviews.loc[ reviews.country.**isin**(['Italy','France']) ]

不用loc也可以用条件索引：df.loc[df['column'] > 0]

但 loc 的优势在于，它可以同时进行行和列选择：

df.loc[df['column'] > 0, 'other_column']

了解数据

reviews.head(n = 5) 返回表头和前n行数据，需要print

.sample(n) 返回随机n行数据

reviews.points.describe() 可以得知一列的计数、平均值、众数等；也可以单独访问每个统计量，如.mean() .median()

reviews.taster_name.unique() 获得一列的去重

reviews.taster_name.value_counts() 获得一列的计数

用map批量操作（只支持调用单个参数）：

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p : p - review_points_mean)

.apply() 支持对每一行操作

def remean_points(row):
	row.points=row.points-review_points_mean
	return row

reviews.apply(remean_points,axis='columns') //axis='columns' 意思是按列这条轴往下？

注意map和apply都不修改原表

直接列与列之间批量操作

review_points_mean=reviews.points.mean()
reviews.points-review_points_mean
reviews.country+" - "+reviews.region_1

获得最大值所在位置 idxmax

bargain_idx=(reviews.points/reviews.price).**idxmax()**
bargain_wine=reviews.loc[bargain_idx, 'title']

统计有多少个desc（字符串）中出现了某一关键词

n_trop=reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity=reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts=pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

pandas.DataFrame.where DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors=’raise’, try_cast=False, raise_on_error=None)

data.sort_values("Team", inplace = True) filter1 = data["Team"]=="Atlanta Hawks" filter2 = data["Age"]>24 data.where(filter1 & filter2, inplace = True)

不满足cond的行将被整行替换为 NaN (DataFrame.mask则将满足cond的行整行替换为NaN)

pandas中concat()的用法 - 知乎 (zhihu.com)

pandas 按行/列遍历的方法：

正确方法：

for i, row in df.**iterrows**():
	# i 为行的索引，row是行数据（一个Series）, row[0] 为该行的第一列
	print(row[0])

错误方法：（索引不一定连续！）

for i in range(df.shape[0]) :
	print(df.loc[i][0])

按列遍历用 for col in df.columns