python成绩判断系统(Python根据成绩分析系统浅析)
python成绩判断系统
Python根据成绩分析系统浅析案例:该数据集的是一个关于每个学生成绩的数据集,接下来我们对该数据集进行分析,判断学生是否适合继续深造
数据集特征展示
|
1 gre 成绩 ( 290 to 340 ) 2 toefl 成绩( 92 to 120 ) 3 学校等级 ( 1 to 5 ) 4 自身的意愿 ( 1 to 5 ) 5 推荐信的力度 ( 1 to 5 ) 6 cgpa成绩 ( 6.8 to 9.92 ) 7 是否有研习经验 ( 0 or 1 ) 8 读硕士的意向 ( 0.34 to 0.97 ) |
1.导入包
|
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import os,sys |
2.导入并查看数据集
|
df = pd.read_csv( "d:\\machine-learning\\score\\admission_predict.csv" ,sep = "," )<br> print ( 'there are ' , len (df.columns), 'columns' )<br> for c in df.columns:<br> sys.stdout.write( str (c) + ', ' |
|
there are 9 columns serial no., gre score, toefl score, university rating, sop, lor , cgpa, research, chance of admit , <br>一共有 9 列特征 |
|
df.info() |
|
< class 'pandas.core.frame.dataframe' > rangeindex: 400 entries, 0 to 399 data columns (total 9 columns): serial no. 400 non - null int64 gre score 400 non - null int64 toefl score 400 non - null int64 university rating 400 non - null int64 sop 400 non - null float64 lor 400 non - null float64 cgpa 400 non - null float64 research 400 non - null int64 chance of admit 400 non - null float64 dtypes: float64( 4 ), int64( 5 ) memory usage: 28.2 kb<br><br>数据集信息:<br> 1. 数据有 9 个特征,分别是学号,gre分数,托福分数,学校等级,sop,lor,cgpa,是否参加研习,进修的几率<br> 2. 数据集中没有空值<br> 3. 一共有 400 条数据 |
|
# 整理列名称 df = df.rename(columns = { 'chance of admit ' : 'chance of admit' })<br> # 显示前5列数据<br>df.head() |
3.查看每个特征的相关性
|
fig,ax = plt.subplots(figsize = ( 10 , 10 )) sns.heatmap(df.corr(),ax = ax,annot = true,linewidths = 0.05 ,fmt = '.2f' ,cmap = 'magma' ) plt.show() |
结论:1.最有可能影响是否读硕士的特征是gre,cgpa,toefl成绩
2.影响相对较小的特征是lor,sop,和research
4.数据可视化,双变量分析
4.1 进行research的人数
|
print ( "not having research:" , len (df[df.research = = 0 ])) print ( "having research:" , len (df[df.research = = 1 ])) y = np.array([ len (df[df.research = = 0 ]), len (df[df.research = = 1 ])]) x = np.arange( 2 ) plt.bar(x,y) plt.title( "research experience" ) plt.xlabel( "canditates" ) plt.ylabel( "frequency" ) plt.xticks(x,( 'not having research' , 'having research' )) plt.show() |
结论:进行research的人数是219,本科没有research人数是181
4.2 学生的托福成绩
|
y = np.array([df[ 'toefl score' ]. min (),df[ 'toefl score' ].mean(),df[ 'toefl score' ]. max ()]) x = np.arange( 3 ) plt.bar(x,y) plt.title( 'toefl score' ) plt.xlabel( 'level' ) plt.ylabel( 'toefl score' ) plt.xticks(x,( 'worst' , 'average' , 'best' )) plt.show() |
结论:最低分92分,最高分满分,进修学生的英语成绩很不错
4.3 gre成绩
|
df[ 'gre score' ].plot(kind = 'hist' ,bins = 200 ,figsize = ( 6 , 6 )) plt.title( 'gre score' ) plt.xlabel( 'gre score' ) plt.ylabel( 'frequency' ) plt.show() |
结论:310和330的分值的学生居多
4.4 cgpa和学校等级的关系
|
plt.scatter(df[ 'university rating' ],df[ 'cgpa' ]) plt.title( 'cgpa scores for university ratings' ) plt.xlabel( 'university rating' ) plt.ylabel( 'cgpa' ) plt.show() |
结论:学校越好,学生的gpa可能就越高
4.5 gre成绩和cgpa的关系
|
plt.scatter(df[ 'gre score' ],df[ 'cgpa' ]) plt.title( 'cgpa for gre scores' ) plt.xlabel( 'gre score' ) plt.ylabel( 'cgpa' ) plt.show() |
结论:gpa基点越高,gre分数越高,2者的相关性很大
4.6 托福成绩和gre成绩的关系
|
df[df[ 'cgpa' ]> = 8.5 ].plot(kind = 'scatter' ,x = 'gre score' ,y = 'toefl score' ,color = 'red' ) plt.xlabel( 'gre score' ) plt.ylabel( 'toefl score' ) plt.title( 'cgpa >= 8.5' ) plt.grid(true) plt.show() |
结论:多数情况下gre和托福成正相关,但是gre分数高,托福一定高。
4.6 学校等级和是否读硕士的关系
|
s = df[df[ 'chance of admit' ] > = 0.75 ][ 'university rating' ].value_counts().head( 5 ) plt.title( 'university ratings of candidates with an 75% acceptance chance' ) s.plot(kind = 'bar' ,figsize = ( 20 , 10 ),cmap = 'pastel1' ) plt.xlabel( 'university rating' ) plt.ylabel( 'candidates' ) plt.show() |
结论:排名靠前的学校的学生,进修的可能性更大
4.7 sop和gpa的关系
|
plt.scatter(df[ 'cgpa' ],df[ 'sop' ]) plt.xlabel( 'cgpa' ) plt.ylabel( 'sop' ) plt.title( 'sop for cgpa' ) plt.show() |
结论: gpa很高的学生,选择读硕士的自我意愿更强烈
4.8 sop和gre的关系
|
plt.scatter(df[ 'gre score' ],df[ 'sop' ]) plt.xlabel( 'gre score' ) plt.ylabel( 'sop' ) plt.title( 'sop for gre score' ) plt.show() |
结论:读硕士意愿强的学生,gre分数较高
5.模型
5.1 准备数据集
|
# 读取数据集 df = pd.read_csv( 'd:\\machine-learning\\score\\admission_predict.csv' ,sep = ',' ) serialno = df[ 'serial no.' ].values df.drop([ 'serial no.' ],axis = 1 ,inplace = true) df = df.rename(columns = { 'chance of admit ' : 'chance of admit' }) # 分割数据集 y = df[ 'chance of admit' ].values x = df.drop([ 'chance of admit' ],axis = 1 ) from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2 ,random_state = 42 ) |
|
# 归一化数据<br>from sklearn.preprocessing import minmaxscaler<br>scalex = minmaxscaler(feature_range=[0,1])<br>x_train[x_train.columns] = scalex.fit_transform(x_train[x_train.columns])<br>x_test[x_test.columns] = scalex.fit_transform(x_test[x_test.columns]) |
5.2 回归
5.2.1 线性回归
|
from sklearn.linear_model import linearregression lr = linearregression() lr.fit(x_train,y_train) y_head_lr = lr.predict(x_test) print ( 'real value of y_test[1]: ' + str (y_test[ 1 ]) + ' -> predict value: ' + str (lr.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test[2]: ' + str (y_test[ 2 ]) + ' -> predict value: ' + str (lr.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import r2_score print ( 'r_square score: ' ,r2_score(y_test,y_head_lr)) y_head_lr_train = lr.predict(x_train) print ( 'r_square score(train data):' ,r2_score(y_train,y_head_lr_train)) |
5.2.2 随机森林回归
|
from sklearn.ensemble import randomforestregressor rfr = randomforestregressor(n_estimators = 100 ,random_state = 42 ) rfr.fit(x_train,y_train) y_head_rfr = rfr.predict(x_test) print ( 'real value of y_test[1]: ' + str (y_test[ 1 ]) + ' -> predict value: ' + str (rfr.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test[2]: ' + str (y_test[ 2 ]) + ' -> predict value: ' + str (rfr.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import r2_score print ( 'r_square score: ' ,r2_score(y_test,y_head_rfr)) y_head_rfr_train = rfr.predict(x_train) print ( 'r_square score(train data):' ,r2_score(y_train,y_head_rfr_train)) |
5.2.3 决策树回归
|
from sklearn.tree import decisiontreeregressor dt = decisiontreeregressor(random_state = 42 ) dt.fit(x_train,y_train) y_head_dt = dt.predict(x_test) print ( 'real value of y_test[1]: ' + str (y_test[ 1 ]) + ' -> predict value: ' + str (dt.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test[2]: ' + str (y_test[ 2 ]) + ' -> predict value: ' + str (dt.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import r2_score print ( 'r_square score: ' ,r2_score(y_test,y_head_dt)) y_head_dt_train = dt.predict(x_train) print ( 'r_square score(train data):' ,r2_score(y_train,y_head_dt_train)) |
5.2.4 三种回归方法比较
|
y = np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dt)]) x = np.arange( 3 ) plt.bar(x,y) plt.title( 'comparion of regression algorithms' ) plt.xlabel( 'regression' ) plt.ylabel( 'r2_score' ) plt.xticks(x,( "linearregression" , "randomforestreg." , "decisiontreereg." )) plt.show() |
结论 : 回归算法中,线性回归的性能更优
5.2.5 三种回归方法与实际值的比较
|
red = plt.scatter(np.arange( 0 , 80 , 5 ),y_head_lr[ 0 : 80 : 5 ],color = 'red' ) blue = plt.scatter(np.arange( 0 , 80 , 5 ),y_head_rfr[ 0 : 80 : 5 ],color = 'blue' ) green = plt.scatter(np.arange( 0 , 80 , 5 ),y_head_dt[ 0 : 80 : 5 ],color = 'green' ) black = plt.scatter(np.arange( 0 , 80 , 5 ),y_test[ 0 : 80 : 5 ],color = 'black' ) plt.title( 'comparison of regression algorithms' ) plt.xlabel( 'index of candidate' ) plt.ylabel( 'chance of admit' ) plt.legend([red,blue,green,black],[ 'lr' , 'rfr' , 'dt' , 'real' ]) plt.show() |
结论:在数据集中有70%的候选人有可能读硕士,从上图来看还有些点没有很好的得到预测
5.3 分类算法
5.3.1 准备数据
|
df = pd.read_csv( 'd:\\machine-learning\\score\\admission_predict.csv' ,sep = ',' ) serialno = df[ 'serial no.' ].values df.drop([ 'serial no.' ],axis = 1 ,inplace = true) df = df.rename(columns = { 'chance of admit ' : 'chance of admit' }) y = df[ 'chance of admit' ].values x = df.drop([ 'chance of admit' ],axis = 1 ) from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2 ,random_state = 42 ) from sklearn.preprocessing import minmaxscaler scalex = minmaxscaler(feature_range = [ 0 , 1 ]) x_train[x_train.columns] = scalex.fit_transform(x_train[x_train.columns]) x_test[x_test.columns] = scalex.fit_transform(x_test[x_test.columns]) # 如果chance >0.8, chance of admit 就是1,否则就是0 y_train_01 = [ 1 if each > 0.8 else 0 for each in y_train] y_test_01 = [ 1 if each > 0.8 else 0 for each in y_test] y_train_01 = np.array(y_train_01) y_test_01 = np.array(y_test_01) |
5.3.2 逻辑回归
|
from sklearn.linear_model import logisticregression lrc = logisticregression() lrc.fit(x_train,y_train_01) print ( 'score: ' ,lrc.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (lrc.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (lrc.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_lrc = confusion_matrix(y_test_01,lrc.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_lrc,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,lrc.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,lrc.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,lrc.predict(x_test))) # test for train dataset: cm_lrc_train = confusion_matrix(y_train_01,lrc.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_lrc_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,逻辑回归算法在训练集样本上,有23个分错的样本,有72人想进一步读硕士
2.在测试集上有7个分错的样本
5.3.3 支持向量机(svm)
|
from sklearn.svm import svc svm = svc(random_state = 1 ,kernel = 'rbf' ) svm.fit(x_train,y_train_01) print ( 'score: ' ,svm.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (svm.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (svm.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_svm = confusion_matrix(y_test_01,svm.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_svm,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,svm.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,svm.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,svm.predict(x_test))) # test for train dataset: cm_svm_train = confusion_matrix(y_train_01,svm.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_svm_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,svm算法在训练集样本上,有22个分错的样本,有70人想进一步读硕士
2.在测试集上有8个分错的样本
5.3.4 朴素贝叶斯
|
from sklearn.naive_bayes import gaussiannb nb = gaussiannb() nb.fit(x_train,y_train_01) print ( 'score: ' ,nb.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (nb.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (nb.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_nb = confusion_matrix(y_test_01,nb.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_nb,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,nb.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,nb.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,nb.predict(x_test))) # test for train dataset: cm_nb_train = confusion_matrix(y_train_01,nb.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_nb_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,朴素贝叶斯算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士
2.在测试集上有7个分错的样本
5.3.5 随机森林分类器
|
from sklearn.ensemble import randomforestclassifier rfc = randomforestclassifier(n_estimators = 100 ,random_state = 1 ) rfc.fit(x_train,y_train_01) print ( 'score: ' ,rfc.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (rfc.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (rfc.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_rfc = confusion_matrix(y_test_01,rfc.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_rfc,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,rfc.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,rfc.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,rfc.predict(x_test))) # test for train dataset: cm_rfc_train = confusion_matrix(y_train_01,rfc.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_rfc_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,随机森林算法在训练集样本上,有0个分错的样本,有88人想进一步读硕士
2.在测试集上有5个分错的样本
5.3.6 决策树分类器
|
from sklearn.tree import decisiontreeclassifier dtc = decisiontreeclassifier(criterion = 'entropy' ,max_depth = 3 ) dtc.fit(x_train,y_train_01) print ( 'score: ' ,dtc.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (dtc.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (dtc.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_dtc = confusion_matrix(y_test_01,dtc.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_dtc,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,dtc.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,dtc.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,dtc.predict(x_test))) # test for train dataset: cm_dtc_train = confusion_matrix(y_train_01,dtc.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_dtc_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,决策树算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士
2.在测试集上有7个分错的样本
5.3.7 k临近分类器
|
from sklearn.neighbors import kneighborsclassifier scores = [] for each in range ( 1 , 50 ): knn_n = kneighborsclassifier(n_neighbors = each) knn_n.fit(x_train,y_train_01) scores.append(knn_n.score(x_test,y_test_01)) plt.plot( range ( 1 , 50 ),scores) plt.xlabel( 'k' ) plt.ylabel( 'accuracy' ) plt.show() knn = kneighborsclassifier(n_neighbors = 7 ) knn.fit(x_train,y_train_01) print ( 'score 7 : ' ,knn.score(x_test,y_test_01)) print ( 'real value of y_test_01[1]: ' + str (y_test_01[ 1 ]) + ' -> predict value: ' + str (knn.predict(x_test.iloc[[ 1 ],:]))) print ( 'real value of y_test_01[2]: ' + str (y_test_01[ 2 ]) + ' -> predict value: ' + str (knn.predict(x_test.iloc[[ 2 ],:]))) from sklearn.metrics import confusion_matrix cm_knn = confusion_matrix(y_test_01,knn.predict(x_test)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_knn,annot = true,linewidths = 0.5 ,linecolor = 'red' ,fmt = '.0f' ,ax = ax) plt.title( 'test for test dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print ( 'precision_score is : ' ,precision_score(y_test_01,knn.predict(x_test))) print ( 'recall_score is : ' ,recall_score(y_test_01,knn.predict(x_test))) print ( 'f1_score is : ' ,f1_score(y_test_01,knn.predict(x_test))) # test for train dataset: cm_knn_train = confusion_matrix(y_train_01,knn.predict(x_train)) f,ax = plt.subplots(figsize = ( 5 , 5 )) sns.heatmap(cm_knn_train,annot = true,linewidths = 0.5 ,linecolor = 'blue' ,fmt = '.0f' ,ax = ax) plt.title( 'test for train dataset' ) plt.xlabel( 'predicted y values' ) plt.ylabel( 'real y value' ) plt.show() |
结论:1.通过混淆矩阵,k临近算法在训练集样本上,有22个分错的样本,有71人想进一步读硕士
2.在测试集上有7个分错的样本
5.3.8 分类器比较
|
y = np.array([lrc.score(x_test,y_test_01),svm.score(x_test,y_test_01),nb.score(x_test,y_test_01), dtc.score(x_test,y_test_01),rfc.score(x_test,y_test_01),knn.score(x_test,y_test_01)]) x = np.arange( 6 ) plt.bar(x,y) plt.title( 'comparison of classification algorithms' ) plt.xlabel( 'classification' ) plt.ylabel( 'score' ) plt.xticks(x,( "logisticreg." , "svm" , "gnb" , "dec.tree" , "ran.forest" , "knn" )) plt.show() |
结论:随机森林和朴素贝叶斯二者的预测值都比较高
5.4 聚类算法
5.4.1 准备数据
|
df = pd.read_csv( 'd:\\machine-learning\\score\\admission_predict.csv' ,sep = ',' ) df = df.rename(columns = { 'chance of admit ' : 'chance of admit' }) serialno = df[ 'serial no.' ] df.drop([ 'serial no.' ],axis = 1 ,inplace = true) df = (df - np. min (df)) / (np. max (df) - np. min (df)) y = df[ 'chance of admit' ] x = df.drop([ 'chance of admit' ],axis = 1 ) |
5.4.2 降维
|
from sklearn.decomposition import pca pca = pca(n_components = 1 ,whiten = true) pca.fit(x) x_pca = pca.transform(x) x_pca = x_pca.reshape( 400 ) dictionary = { 'x' :x_pca, 'y' :y} data = pd.dataframe(dictionary) print ( 'pca data:' ,data.head()) print () print ( 'orin data:' ,df.head()) |
5.4.3 k均值聚类
|
from sklearn.cluster import kmeans wcss = [] for k in range ( 1 , 15 ): kmeans = kmeans(n_clusters = k) kmeans.fit(x) wcss.append(kmeans.inertia_) plt.plot( range ( 1 , 15 ),wcss) plt.xlabel( 'kmeans' ) plt.ylabel( 'wcss' ) plt.show() df[ "serial no." ] = serialno kmeans = kmeans(n_clusters = 3 ) clusters_knn = kmeans.fit_predict(x) df[ 'label_kmeans' ] = clusters_knn plt.scatter(df[df.label_kmeans = = 0 ][ "serial no." ],df[df.label_kmeans = = 0 ][ 'chance of admit' ],color = "red" ) plt.scatter(df[df.label_kmeans = = 1 ][ "serial no." ],df[df.label_kmeans = = 1 ][ 'chance of admit' ],color = "blue" ) plt.scatter(df[df.label_kmeans = = 2 ][ "serial no." ],df[df.label_kmeans = = 2 ][ 'chance of admit' ],color = "green" ) plt.title( "k-means clustering" ) plt.xlabel( "candidates" ) plt.ylabel( "chance of admit" ) plt.show() plt.scatter(data.x[df.label_kmeans = = 0 ],data[df.label_kmeans = = 0 ].y,color = "red" ) plt.scatter(data.x[df.label_kmeans = = 1 ],data[df.label_kmeans = = 1 ].y,color = "blue" ) plt.scatter(data.x[df.label_kmeans = = 2 ],data[df.label_kmeans = = 2 ].y,color = "green" ) plt.title( "k-means clustering" ) plt.xlabel( "x" ) plt.ylabel( "chance of admit" ) plt.show() |
结论:数据集分成三个类别,一部分学生是决定继续读硕士,一部分放弃,还有一部分学生的比较犹豫,但是深造的可能性较大
5.4.4 层次聚类
|
from scipy.cluster.hierarchy import linkage,dendrogram merg = linkage(x,method = 'ward' ) dendrogram(merg,leaf_rotation = 90 ) plt.xlabel( 'data points' ) plt.ylabel( 'euclidean distance' ) plt.show() from sklearn.cluster import agglomerativeclustering hiyerartical_cluster = agglomerativeclustering(n_clusters = 3 ,affinity = 'euclidean' ,linkage = 'ward' ) clusters_hiyerartical = hiyerartical_cluster.fit_predict(x) df[ 'label_hiyerartical' ] = clusters_hiyerartical plt.scatter(df[df.label_hiyerartical = = 0 ][ "serial no." ],df[df.label_hiyerartical = = 0 ][ 'chance of admit' ],color = "red" ) plt.scatter(df[df.label_hiyerartical = = 1 ][ "serial no." ],df[df.label_hiyerartical = = 1 ][ 'chance of admit' ],color = "blue" ) plt.scatter(df[df.label_hiyerartical = = 2 ][ "serial no." ],df[df.label_hiyerartical = = 2 ][ 'chance of admit' ],color = "green" ) plt.title( 'hierarchical clustering' ) plt.xlabel( 'candidates' ) plt.ylabel( 'chance of admit' ) plt.show() plt.scatter(data[df.label_hiyerartical = = 0 ].x,data.y[df.label_hiyerartical = = 0 ],color = 'red' ) plt.scatter(data[df.label_hiyerartical = = 1 ].x,data.y[df.label_hiyerartical = = 1 ],color = 'blue' ) plt.scatter(data[df.label_hiyerartical = = 2 ].x,data.y[df.label_hiyerartical = = 2 ],color = 'green' ) plt.title( 'hierarchical clustering' ) plt.xlabel( 'x' ) plt.ylabel( 'chance of admit' ) plt.show() |
结论:从层次聚类的结果中,可以看出和k均值聚类的结果一致,只不过确定了聚类k的取值3
结论:通过本词入门数据集的训练,可以掌握
1.一些特征的展示的方法
2.如何调用sklearn 的api
3.如何取比较不同模型之间的好坏
代码+数据集:https://github.com/mounment/python-data-analyze/tree/master/kaggle/score
原文链接:https://www.cnblogs.com/luhuajun/p/10361463.html
- python 操作html(Python HTML解析模块HTMLParser用法分析爬虫工具)
- python中if的条件语句(浅谈Python的条件判断语句if/else语句)
- jupyter如何编写python(windows系统中Python多版本与jupyter notebook使用虚拟环境的过程)
- python opencv 标记目标(使用Python的OpenCV模块识别滑动验证码的缺口推荐)
- python3有哪些内置模块(Python3.5内置模块之os模块、sys模块、shutil模块用法实例分析)
- python经典算法(浅谈python常用程序算法)
- python中encode中文自定义编码(详解Python解决抓取内容乱码问题decode和encode解码)
- knn算法详细步骤(Python实现KNNK-近邻算法的示例代码)
- 如何去阿里云解析域名(利用Python+阿里云实现DDNS动态域名解析的方法)
- python函数的参数有几种类型(在Python中居然可以定义两个同名参数的函数)
- python把文件上传服务器(Python 实现两个服务器之间文件的上传方法)
- python做出来的游戏按什么键运行(python pygame实现方向键控制小球)
- 多个图片拼接python实现(python实现两张图片的像素融合)
- 2021-10-07 00:38:09
- python flask部署实例(Python Flask框架扩展操作示例)
- pythongui实战案例(Python GUI编程完整示例)
- 成都旅游攻略(成都旅游攻略自由行最佳线路)
- 给儿童吃什么最好(给儿童吃什么最好消化)
- 杭州旅游攻略()
- 云南旅游攻略(云南旅游攻略5天攻略)
- 收藏 春节假期,这些景区巨划算(收藏春节假期这些景区巨划算)
- 景区游玩,这些安全知识要牢记(这些安全知识要牢记)
热门推荐
- sqlserver2016使用教程(SQL Server 2016 Alwayson新增功能图文详解)
- html5中table属性(Html5之自定义属性data-,dataset)
- 织梦cms怎么设置页面(织梦CMS调用问答栏目文章到首页实现方法分享)
- windows搭建php环境(windows 2008r2+php5.6.28环境搭建详细过程)
- mysql中数据类型的学习体会(MySQL 实现lastInfdexOf的功能案例)
- SQL Server数据库备份的几种方式
- dede检测写入权限(dede织梦dede5.7上传图片出现302以及Error 2038问题解决方法)
- 怎么给div添加按下去效果(DIV点击折叠实例代码)
- js扫雷小游戏源代码(原生js实现简单贪吃蛇小游戏)
- nginx 重置端口号(详解如何修改nginx的默认端口)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9