첫번째 kaggle 필사 기록.
scaler 사용방법, GridSearchCV에 대해 새롭게 알게 되었다
kaggle Support Vector Machine PCA 필사¶
출처: https://www.kaggle.com/faressayah/support-vector-machine-pca-tutorial-for-beginner
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid') # 그래프 테두리 모두 제거
# 데이터 불러오기
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
cancer.keys() # data에는 설명변수 # target에는 label 변수
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
col_names = list(cancer.feature_names)
col_names.append('target')
df = pd.DataFrame(np.c_[cancer.data,cancer.target], columns=col_names)
df.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0.0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0.0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0.0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0.0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0.0 |
5 rows × 31 columns
# 데이터 간단하게 살펴보기
cancer.target_names # target변수의 class명
array(['malignant', 'benign'], dtype='<U9')
df.describe()
# 설명변수들의 scale 차이가 크다!
# SVM 이런 scale 차이 큰 데이터에 취약
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | 0.627417 |
std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | 0.483918 |
min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | 0.000000 |
25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | 0.000000 |
50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | 1.000000 |
75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | 1.000000 |
max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | 1.000000 |
8 rows × 31 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 target 569 non-null float64
dtypes: float64(31)
memory usage: 137.9 KB
df.columns # mean/error/worst
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension',
'target'],
dtype='object')
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
위의 a-j의 요소들은 각각 mean, se, worst 3가지 변수를 가지게 되는데, mean은 평균, se는 표준오차, worst는 각 특성의 가장 큰 값 3개의 평균을 의미한다. 예를 들어, radius_mean과 radius_se는 각각 종양의 반지름과 표준오차를 뜻하며, radius_worst는 가 장 반지름이 큰 종양 3개의 평균 반지름을 의미한다.
# mean 변수들로 pairplot 그리기
sns.pairplot(df, hue='target', vars=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension'])
# vars : list of variable names
# Variables within ``data`` to use, otherwise use every column with
<seaborn.axisgrid.PairGrid at 0x1d1e8bac850>
설명변수들 간 선형관계 보임 -> 다중공선성
# target 변수 살펴보기
sns.countplot(df['target'])
C:\Users\Hera\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
<AxesSubplot:xlabel='target', ylabel='count'>
# 설명변수들 간의 상관관계
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True,square=True,cmap='RdBu')
# annot=True 데이터 값 셀 안에 적어줌
# square=True 정사각형으로 출력
# Color Brewer 팔레트 중 diverging : Diverging 팔레트는 양쪽으로 강조가 되기 때문에 낮은 값과 높은 값에 모두 관심을 가져야 하는 데이터 세트에 적합.
<AxesSubplot:>
model training¶
from sklearn.model_selection import cross_val_score # k-fold CV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
#파이프라인은 여러 변환 단계를 정확한 순서대로 실행할 수 있도록 하는 것입니다.
# 사이킷런은 연속된 변환을 순서대로 처리할 수 있도록 도와주는 Pipeline 클래스가 있습니다
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler 기본 스케일. 평균과 표준편차 사용
# MinMaxScaler 최대/최소값이 각각 1, 0이 되도록 스케일링
# 데이터 나누기
X = df.drop('target', axis=1)
y = df.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(398, 30) (171, 30) (398,) (171,)
- SVM
# linear kernel SVM
from sklearn.svm import LinearSVC
model = LinearSVC(loss='hinge', dual=True)
model.fit(X_train,y_train)
# support vector machine의 loss를 hinge라고 함
# squared_hinge는 hinge제곱한값, 보통 그냥 hinge를 자주씀
# loss : {'hinge', 'squared_hinge'}, default='squared_hinge'
# Specifies the loss function. 'hinge' is the standard SVM loss
# (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss.
# dual : bool, default=True
# Select the algorithm to either solve the dual or primal
# optimization problem. Prefer dual=False when n_samples > n_features.
C:\Users\Hera\anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
LinearSVC(loss='hinge')
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[ 53, 10],
[ 0, 108]], dtype=int64)
# LinearSVC의 accuracy
accuracy_score(y_test,y_pred)
0.9415204678362573
# linear SVM
model = SVC(kernel='linear')
model.fit(X_train,y_train)
SVC(kernel='linear')
y_pred = model.predict(X_test)
confusion_matrix(y_test,y_pred)
array([[ 59, 4],
[ 2, 106]], dtype=int64)
accuracy_score(y_test,y_pred)
0.9649122807017544
# polynomial Kernel SVM
# 내 똥컴에서는 너무 느려 기다리지 못했다..
from sklearn.svm import SVC
model = SVC(kernel='poly', degree=3, gamma='auto',coef0=1,C=5) # coef0는 kernel이 poly, sigmoid 일때만 유의
model.fit(X_train, y_train)
# radial Kernel
model = SVC(kernel='rbf', gamma=0.5, C=0.1)
model.fit(X_train,y_train)
SVM을 위한 데이터 준비¶
- numerical inputs: SVM은 기본적으로 설명변수들이 numeric하다고 가정함 따라서 범주형 변수는 dummy 변수로 변환 필요
- Binary Classification: 기본 SVM은 이항분류, 물론 regression이나 multi-class classification도 가능
- 이 데이터는 위의 두 조건 이미 충족(설명변수 전부 numeric, y도 이항변수임)
# 설명변수 스케일링
# StandardScaler 기본 스케일. 평균과 표준편차 사용
# MinMaxScaler 최대/최소값이 각각 1, 0이 되도록 스케일링
# pipeline 구축
pipeline = Pipeline([
('min_max_scaler',MinMaxScaler()),
('std_scaler', StandardScaler())
])
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
#하지만 테스트 데이터에 scaled_X_test = scaler.fit_transform(X_test)를 적용해서는 안됩니다.
# 이를 수행하면 scaler 객체가 기존에 학습 데이터에 fit 했던 기준을 모두 무시하고
# 다시 테스트 데이터를 기반으로 기준을 적용하기 때문입니다.
#때문에 테스트 데이터에 fit_transform()을 적용해서는 안됩니다.
#이런 번거로움을 피하기 위해 학습과 테스트 데이터로 나누기 전에 먼저 Scaling등의 데이터 전처리를 해주는 것이 좋습니다.
# 지도 학습 모델에서 테스트 세트를 훈련 세트와 테스트 세트에 같은 변환을 적용해야 한다는 점이 중요!
X_train
array([[-0.12348985, -0.29680142, -0.17050713, ..., -0.84082156,
-0.8563616 , -0.76574773],
[-0.22826757, -0.65795149, -0.25377521, ..., -0.37706655,
-1.3415819 , -0.41480748],
[ 0.14553402, -1.23056444, 0.24583328, ..., -0.04762652,
-0.08997059, 0.4882635 ],
...,
[ 0.03226081, -0.55578404, -0.08064356, ..., -1.26179013,
-0.6828391 , -1.27672587],
[-0.05552593, 0.10949242, -0.04684166, ..., 1.07924018,
0.4755842 , 1.25530227],
[-0.56525537, 0.32333128, -0.619825 , ..., -0.61952313,
-0.30366032, -0.84348042]])
# 데이터 스케일링 한 후 SVM 적용
model = SVC(kernel='linear')
model.fit(X_train,y_train)
SVC(kernel='linear')
y_pred = model.predict(X_test)
confusion_matrix(y_test,y_pred)
array([[ 61, 2],
[ 2, 106]], dtype=int64)
accuracy_score(y_test,y_pred)
0.9766081871345029
-> 설명변수를 스케일링한 결과 예측력 올라감
from sklearn.model_selection import GridSearchCV
# hyperparameter 결정 -> 대부분의 경우 GridSearchCV 이용
param_grid = {'C': [0.01,0.1,0.5,1,10,100], # default=1.0
'gamma': [1,0.75,0.5,0.25,0.1,0.01,0.001], # if 'auto', uses 1 / n_features.
'kernel': ['rbf','poly','linear']}
grid = GridSearchCV(SVC(), param_grid=param_grid,cv=5,iid=True,refit=True, verbose=1)
# refit=True가 default: 다시 fit해서 best param 찾아주겠다
# verbose: GridSearchCV의 iteration시마다 수행 결과 메시지를 출력합니다.
# verbose=0(default)면 메시지 출력 안함
# verbose=1이면 간단한 메시지 출력
# verbose=2이면 하이퍼 파라미터별 메시지 출력
grid.fit(X_train,y_train)
Fitting 5 folds for each of 126 candidates, totalling 630 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 630 out of 630 | elapsed: 3.0s finished
C:\Users\Hera\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
warnings.warn(
GridSearchCV(cv=5, estimator=SVC(), iid=True,
param_grid={'C': [0.01, 0.1, 0.5, 1, 10, 100],
'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly', 'linear']},
verbose=1)
grid.best_params_
{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}
svm_clf = SVC(C=0.1,gamma=1,kernel='linear')
svm_clf.fit(X_train,y_train)
SVC(C=0.1, gamma=1, kernel='linear')
y_pred = svm_clf.predict(X_test)
print(confusion_matrix(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
[[ 61 1]
[ 2 107]]
0.9824561403508771
-> 예측력 상승
# 데이터 스케일링
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
# pca로 축소한 차원 따로 저장 -> 이거 이용해서 다시 모델링 할것이므로
X_train2 = pca.fit_transform(X_train) # Z1,Z2,Z3
X_test2 = pca.transform(X_test)
print(X_train2.shape, X_test2.shape) # 3개의 PC score만 저장된 것 확인
(398, 3) (171, 3)
X_train.shape
(398, 30)
개념대로 적용하기(행렬사용)
pca = PCA()
PCscore_train = pca.fit_transform(X_train)
PCscore_test = pca.transform(X_test)
eigen_value = pca.explained_variance_
eigen_value.shape
(30,)
eigen_vector = pca.components_.transpose() # 고유벡터의 전치행렬, eT
# 첫번째 열이 첫번째 고유값으로 만든 고유벡터 e1T
# 두번째 열이 두번째 고유값으로 만든 고유벡터 e2T
# 세번째 열이 세번째 고유값으로 만든 고유벡터 e3T...e30T
eigen_vector.shape
(30, 30)
np.dot(X_train,eigen_vector) # pc score, Z1...Z30
array([[-3.08484180e+00, -2.15870446e+00, -3.39874585e-01, ...,
3.08256329e-02, -1.30215619e-02, 1.01519951e-02],
[-2.18264681e+00, -6.17571155e-01, 4.47207650e-01, ...,
4.52299177e-02, 4.55773527e-03, 9.01158415e-04],
[ 2.04995887e+00, 2.32895331e+00, 1.16940721e+00, ...,
3.67990822e-02, 4.25992147e-03, -6.02878883e-03],
...,
[-4.55371608e+00, -3.14400031e+00, -2.12369915e-02, ...,
1.46758031e-02, 3.30986949e-03, -1.37144339e-02],
[ 7.20423555e-01, 5.47831401e-01, -2.74887748e+00, ...,
-2.87129689e-02, -1.81522701e-02, -3.54864423e-03],
[-3.41936854e+00, -1.33970253e+00, -7.96429006e-03, ...,
-3.12956806e-03, 1.28114834e-02, 6.54333122e-04]])
PCscore_train
array([[-3.08484180e+00, -2.15870446e+00, -3.39874585e-01, ...,
3.08256329e-02, -1.30215619e-02, 1.01519951e-02],
[-2.18264681e+00, -6.17571155e-01, 4.47207650e-01, ...,
4.52299177e-02, 4.55773527e-03, 9.01158415e-04],
[ 2.04995887e+00, 2.32895331e+00, 1.16940721e+00, ...,
3.67990822e-02, 4.25992147e-03, -6.02878883e-03],
...,
[-4.55371608e+00, -3.14400031e+00, -2.12369915e-02, ...,
1.46758031e-02, 3.30986949e-03, -1.37144339e-02],
[ 7.20423555e-01, 5.47831401e-01, -2.74887748e+00, ...,
-2.87129689e-02, -1.81522701e-02, -3.54864423e-03],
[-3.41936854e+00, -1.33970253e+00, -7.96429006e-03, ...,
-3.12956806e-03, 1.28114834e-02, 6.54333122e-04]])
eigen_value = pca.explained_variance_ # 고유값
choose = []
for i in eigen_value:
a = i/np.sum(eigen_value)
choose.append(a)
choose # Z3부터 값이 확 작아짐 -> Z1,Z2 까지만 선택해도 될듯하다
[0.43167479744322024,
0.19845651750077126,
0.09733159192997186,
0.0653157415624998,
0.05212151207016257,
0.04198959644450524,
0.022634609138970724,
0.01682668652604661,
0.012946900414362205,
0.012094099143067838,
0.010571854106342564,
0.008992780948138657,
0.008094113609679003,
0.005107617279145752,
0.0028272393829968587,
0.0022931012582374033,
0.001982055813007543,
0.0017973165949155163,
0.001662632381205944,
0.0010560996593097865,
0.00091814344492389,
0.0009039266942731821,
0.0007827532266454116,
0.0005688006003322696,
0.0005033805633839237,
0.00024412978420170582,
0.0002231853409664187,
4.963272673588701e-05,
2.5302419153286345e-05,
3.881992826491497e-06]
- PCA 시각화
plt.figure(figsize=(8,6))
plt.scatter(X_train2[:,0],X_train2[:,1],c=y_train,cmap='plasma')
plt.xlabel('첫번째 주성분')
plt.ylabel('두번째 주성분')
Text(0, 0.5, '두번째 주성분')
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 52395 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 48264 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 51704 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 51452 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 49457 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 48516 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 46160 missing from current font.
font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 52395 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 48264 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 51704 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 51452 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 49457 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 48516 missing from current font.
font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 46160 missing from current font.
font.set_text(s, 0, flags=flags)
-> 두가지 주성분(Z1,Z2)만 사용해도 y분류가 잘 되어진 것을 확인!
# hyperparameter tuning
param_grid = {'C': [0.01,0.1,0.5,1,10,100],
'gamma': [1,0.75,0.5,0.25,0.1,0.01,0.001],
'kernel': ['rbf','poly','linear']}
grid = GridSearchCV(SVC(),param_grid=param_grid, refit=True, verbose=1, cv=5, iid=True)
grid.fit(X_train2,y_train)
print(grid.best_params_)
Fitting 5 folds for each of 126 candidates, totalling 630 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
{'C': 1, 'gamma': 0.75, 'kernel': 'poly'}
[Parallel(n_jobs=1)]: Done 630 out of 630 | elapsed: 18.1s finished
C:\Users\Hera\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
warnings.warn(
# 모델 적합
svm_clf2 = SVC(C=1,gamma=0.75,kernel='poly')
svm_clf2.fit(X_train2,y_train)
SVC(C=1, gamma=0.75, kernel='poly')
# 예측력 확인
y_pred = svm_clf2.predict(X_test2)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
[[ 59 4]
[ 2 106]]
0.9649122807017544
-> 3개의 변수(Z1,Z2,Z3)만으로 높은 예측력 가능
'Data' 카테고리의 다른 글
[python] 멜론차트 크롤링 ( Selenium, BeautifulSoup ) (741) | 2021.02.17 |
---|---|
[python] 네이버 뉴스 기사 작성일, 제목, url 크롤링 ( BeautifulSoup ) (378) | 2021.02.16 |
[R] 데이터 시각화 with R ( 롤리팝차트 / 덤벨차트 / 슬로프차트 ) (387) | 2021.01.28 |
[R] 데이터 시각화 with R ( 막대그래프/와플차트 ) (384) | 2021.01.26 |
[R] R로 하는 텍스트 전처리2 ( 동시 출현 빈도 / tf-idf/ wordcloud2) (feat. 기리보이) (372) | 2021.01.15 |