数据挖掘流程:数据可视化与预处理

数据挖掘的第一步,当我们手中拿到一份数据,当然是对数据进行观察与预处理了。本文主要对这两个方面做一个个人的总结。

import 掉包?

第0步,调包。。

通常numpy、pandas、scipy、seaborn已经matplotlib是必须的

为防止烦人的warning,还需要

1
2
import warnings
warnings.filterwarnings('ignore')

因为我们不管warning只管error,哈哈😂

seaborn的设置颜色格式

1
sns.set_style('whitegrid')

以及

1
%matplotlib inline

观察与可视化

一般来说,我们应该首先观察标签的分布以及特征的分布,对于前者通常使用pandas Dataframe的columns,对于后者则是value_counts()describe()来展示。如

1
df_train.columns
1
int_level = train_df['interest_level'].value_counts()
1
df_train['SalePrice'].describe()

describe()会给出标签的各种数值信息,如

1
2
3
4
5
6
7
8
9
count      1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64

同时也可以借助于seaborn图来显示

1
sns.distplot(df_train['SalePrice']);

接下来可以看看标签的偏度和峰度(skewness and kurtosis)

1
2
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())

特征与标签关系

对于数值特征可用散点图来展示

1
2
3
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

对于类别特征,则可用箱型图

1
2
3
4
5
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

时间特征类似

1
2
3
4
5
6
var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

特征与特征的关系

用关联矩阵Correlation matrix的heatmap类型可以展示所有特征与特征的关系

1
2
3
4
#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

从关联矩阵我们可以对特征的总体关系有个认识,那么进一步可以用zoomed heatmap来筛选几个特征进行观察其之间的关系

1
2
3
4
5
6
7
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

大杀器则是scatterplot,如

1
2
3
4
5
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

预处理

缺失值 Missing data¶

首先观察包含缺失值的特征

1
2
3
4
5
#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

以上将打印包含缺失值的特征的缺失个数和比例

当然对于缺失值,不同情况有不同的处理方法,例如

这里我们简单的将missing超过1的feature丢弃,再将missing值为1的样本删去,其他方法有待之后补充。

1
2
3
4
#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max()

离群点

单特征情况,可以以如下形式观察

1
2
3
4
5
6
7
8
#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
outer range (low) of the distribution:
[[-1.83820775]
[-1.83303414]
[-1.80044422]
[-1.78282123]
[-1.77400974]
[-1.62295562]
[-1.6166617 ]
[-1.58519209]
[-1.58519209]
[-1.57269236]]

outer range (high) of the distribution:
[[ 3.82758058]
[ 4.0395221 ]
[ 4.49473628]
[ 4.70872962]
[ 4.728631 ]
[ 5.06034585]
[ 5.42191907]
[ 5.58987866]
[ 7.10041987]
[ 7.22629831]]

双特征关系如特征标签关系那样,先concat再scatter

1
2
3
4
#bivariate analysis saleprice/grlivarea
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

离群点可以简单删去,如

正态分布

用histogram 和 normal probability可以观察数据的峰度偏度以及是否符合正态分布

1
2
3
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)


对于正偏度,取log通常不错

1
2
3
4
5
6
7
#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])
In [22]:
#transformed histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)


有时我们会遇到具有许多0值得特征,而从istogram 和 normal probability可以看出其具有偏度,那么可以使用如下方法:将非零点取log保留零点。

1
2
3
4
5
6
7
8
#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index)
df_train['HasBsmt'] = 0
df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1
In [28]:
#transform data
df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])

对于标签,如果偏度太大,可以使用log1p来处理,并对比前后的差距

1
2
3
4
5
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
"log(price + 1)":np.log1p(train["SalePrice"])
})
prices.hist()

对全部Dataframe处理需要concat test与train,一并处理,

1
all_data = pd.concat((train.loc[:,"MSSubClass": "SaleCondition"], test.loc[:,"MSSubClass": "SaleCondition"]))

最后需要将预测的标签值放大还原,如

1
lasso_preds = np.expm1(model_lasso.predict(X_test))

参考出处