첫번째 kaggle 필사 기록.

scaler 사용방법, GridSearchCV에 대해 새롭게 알게 되었다

kaggle Support Vector Machine PCA 필사¶

출처: https://www.kaggle.com/faressayah/support-vector-machine-pca-tutorial-for-beginner

In [14]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')  # 그래프 테두리 모두 제거

In [15]:

# 데이터 불러오기
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [16]:

cancer.keys() # data에는 설명변수 # target에는 label 변수

Out[16]:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [17]:

col_names = list(cancer.feature_names)
col_names.append('target')

df = pd.DataFrame(np.c_[cancer.data,cancer.target], columns=col_names)
df.head()

Out[17]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

In [18]:

# 데이터 간단하게 살펴보기
cancer.target_names # target변수의 class명

Out[18]:

array(['malignant', 'benign'], dtype='<U9')

In [19]:

df.describe()
# 설명변수들의 scale 차이가 크다!
# SVM 이런 scale 차이 큰 데이터에 취약

Out[19]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	...	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	...	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946	0.627417
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	...	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061	0.483918
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	...	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040	0.000000
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	...	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460	0.000000
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	...	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040	1.000000
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	...	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080	1.000000
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	...	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500	1.000000

8 rows × 31 columns

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  target                   569 non-null    float64
dtypes: float64(31)
memory usage: 137.9 KB

데이터 시각화¶

In [8]:

df.columns # mean/error/worst

Out[8]:

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

위의 a-j의 요소들은 각각 mean, se, worst 3가지 변수를 가지게 되는데, mean은 평균, se는 표준오차, worst는 각 특성의 가장 큰 값 3개의 평균을 의미한다. 예를 들어, radius_mean과 radius_se는 각각 종양의 반지름과 표준오차를 뜻하며, radius_worst는 가 장 반지름이 큰 종양 3개의 평균 반지름을 의미한다.

In [9]:

# mean 변수들로 pairplot 그리기
sns.pairplot(df, hue='target', vars=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension']) 
# vars : list of variable names
    # Variables within ``data`` to use, otherwise use every column with

Out[9]:

<seaborn.axisgrid.PairGrid at 0x1d1e8bac850>

설명변수들 간 선형관계 보임 -> 다중공선성

In [10]:

# target 변수 살펴보기
sns.countplot(df['target'])

C:\Users\Hera\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Out[10]:

<AxesSubplot:xlabel='target', ylabel='count'>

In [11]:

# 설명변수들 간의 상관관계
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True,square=True,cmap='RdBu') 
# annot=True 데이터 값 셀 안에 적어줌
# square=True 정사각형으로 출력
# Color Brewer 팔레트 중 diverging : Diverging 팔레트는 양쪽으로 강조가 되기 때문에 낮은 값과 높은 값에 모두 관심을 가져야 하는 데이터 세트에 적합.

Out[11]:

<AxesSubplot:>

model training¶

In [12]:

from sklearn.model_selection import cross_val_score # k-fold CV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

#파이프라인은 여러 변환 단계를 정확한 순서대로 실행할 수 있도록 하는 것입니다. 
# 사이킷런은 연속된 변환을 순서대로 처리할 수 있도록 도와주는 Pipeline 클래스가 있습니다

from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler	기본 스케일. 평균과 표준편차 사용
# MinMaxScaler	최대/최소값이 각각 1, 0이 되도록 스케일링

In [13]:

# 데이터 나누기
X = df.drop('target', axis=1)
y = df.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(398, 30) (171, 30) (398,) (171,)

In [35]:

# linear kernel SVM
from sklearn.svm import LinearSVC

model = LinearSVC(loss='hinge', dual=True)
model.fit(X_train,y_train)
# support vector machine의 loss를 hinge라고 함
# squared_hinge는 hinge제곱한값, 보통 그냥 hinge를 자주씀

#  loss : {'hinge', 'squared_hinge'}, default='squared_hinge'
# Specifies the loss function. 'hinge' is the standard SVM loss
# (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss.

# dual : bool, default=True
# Select the algorithm to either solve the dual or primal
# optimization problem. Prefer dual=False when n_samples > n_features.

C:\Users\Hera\anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "

Out[35]:

LinearSVC(loss='hinge')

In [36]:

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

Out[36]:

array([[ 53,  10],
       [  0, 108]], dtype=int64)

In [37]:

# LinearSVC의 accuracy
accuracy_score(y_test,y_pred) 

Out[37]:

0.9415204678362573

In [40]:

# linear SVM
model = SVC(kernel='linear')
model.fit(X_train,y_train)

Out[40]:

SVC(kernel='linear')

In [41]:

y_pred = model.predict(X_test)
confusion_matrix(y_test,y_pred)

Out[41]:

array([[ 59,   4],
       [  2, 106]], dtype=int64)

In [42]:

accuracy_score(y_test,y_pred)

Out[42]:

0.9649122807017544

In [18]:

# polynomial Kernel SVM
# 내 똥컴에서는 너무 느려 기다리지 못했다..
from sklearn.svm import SVC
model = SVC(kernel='poly', degree=3, gamma='auto',coef0=1,C=5) # coef0는 kernel이 poly, sigmoid 일때만 유의
model.fit(X_train, y_train)

# radial Kernel
model = SVC(kernel='rbf', gamma=0.5, C=0.1)
model.fit(X_train,y_train)

SVM을 위한 데이터 준비¶

- numerical inputs: SVM은 기본적으로 설명변수들이 numeric하다고 가정함 따라서 범주형 변수는 dummy 변수로 변환 필요
- Binary Classification: 기본 SVM은 이항분류, 물론 regression이나 multi-class classification도 가능
- 이 데이터는 위의 두 조건 이미 충족(설명변수 전부 numeric, y도 이항변수임)

In [43]:

# 설명변수 스케일링
# StandardScaler	기본 스케일. 평균과 표준편차 사용
# MinMaxScaler	최대/최소값이 각각 1, 0이 되도록 스케일링

# pipeline 구축
pipeline = Pipeline([
    ('min_max_scaler',MinMaxScaler()),
    ('std_scaler', StandardScaler())
])

X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)

#하지만 테스트 데이터에 scaled_X_test = scaler.fit_transform(X_test)를 적용해서는 안됩니다.
# 이를 수행하면 scaler 객체가 기존에 학습 데이터에 fit 했던 기준을 모두 무시하고 
   # 다시 테스트 데이터를 기반으로 기준을 적용하기 때문입니다. 
#때문에 테스트 데이터에 fit_transform()을 적용해서는 안됩니다.
#이런 번거로움을 피하기 위해 학습과 테스트 데이터로 나누기 전에 먼저 Scaling등의 데이터 전처리를 해주는 것이 좋습니다. 

# 지도 학습 모델에서 테스트 세트를 훈련 세트와 테스트 세트에 같은 변환을 적용해야 한다는 점이 중요!

In [44]:

X_train

Out[44]:

array([[-0.12348985, -0.29680142, -0.17050713, ..., -0.84082156,
        -0.8563616 , -0.76574773],
       [-0.22826757, -0.65795149, -0.25377521, ..., -0.37706655,
        -1.3415819 , -0.41480748],
       [ 0.14553402, -1.23056444,  0.24583328, ..., -0.04762652,
        -0.08997059,  0.4882635 ],
       ...,
       [ 0.03226081, -0.55578404, -0.08064356, ..., -1.26179013,
        -0.6828391 , -1.27672587],
       [-0.05552593,  0.10949242, -0.04684166, ...,  1.07924018,
         0.4755842 ,  1.25530227],
       [-0.56525537,  0.32333128, -0.619825  , ..., -0.61952313,
        -0.30366032, -0.84348042]])

In [46]:

# 데이터 스케일링 한 후 SVM 적용
model = SVC(kernel='linear')
model.fit(X_train,y_train)

Out[46]:

SVC(kernel='linear')

In [47]:

y_pred = model.predict(X_test)
confusion_matrix(y_test,y_pred)

Out[47]:

array([[ 61,   2],
       [  2, 106]], dtype=int64)

In [48]:

accuracy_score(y_test,y_pred)

Out[48]:

0.9766081871345029

-> 설명변수를 스케일링한 결과 예측력 올라감

hyperparameter tuning¶

- 우리가 결정해야 할것: C, gamma, 어떤 kernel 사용

In [50]:

from sklearn.model_selection import GridSearchCV
# hyperparameter 결정 -> 대부분의 경우 GridSearchCV 이용

In [55]:

param_grid = {'C': [0.01,0.1,0.5,1,10,100], # default=1.0
             'gamma': [1,0.75,0.5,0.25,0.1,0.01,0.001], # if 'auto', uses 1 / n_features.
             'kernel': ['rbf','poly','linear']}

grid = GridSearchCV(SVC(), param_grid=param_grid,cv=5,iid=True,refit=True, verbose=1)
# refit=True가 default: 다시 fit해서 best param 찾아주겠다
# verbose: GridSearchCV의 iteration시마다 수행 결과 메시지를 출력합니다.
# verbose=0(default)면 메시지 출력 안함
# verbose=1이면 간단한 메시지 출력
# verbose=2이면 하이퍼 파라미터별 메시지 출력

grid.fit(X_train,y_train)

Fitting 5 folds for each of 126 candidates, totalling 630 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 630 out of 630 | elapsed:    3.0s finished
C:\Users\Hera\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  warnings.warn(

Out[55]:

GridSearchCV(cv=5, estimator=SVC(), iid=True,
             param_grid={'C': [0.01, 0.1, 0.5, 1, 10, 100],
                         'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001],
                         'kernel': ['rbf', 'poly', 'linear']},
             verbose=1)

In [56]:

grid.best_params_

Out[56]:

{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}

In [57]:

svm_clf = SVC(C=0.1,gamma=1,kernel='linear')
svm_clf.fit(X_train,y_train)

Out[57]:

SVC(C=0.1, gamma=1, kernel='linear')

In [58]:

y_pred = svm_clf.predict(X_test)
print(confusion_matrix(y_pred,y_test))
print(accuracy_score(y_pred,y_test))

[[ 61   1]
 [  2 107]]
0.9824561403508771

-> 예측력 상승

PCA¶

In [61]:

# 데이터 스케일링
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [62]:

from sklearn.decomposition import PCA

In [63]:

pca = PCA(n_components=3)

# pca로 축소한 차원 따로 저장 -> 이거 이용해서 다시 모델링 할것이므로

X_train2 = pca.fit_transform(X_train) # Z1,Z2,Z3
X_test2 = pca.transform(X_test)

print(X_train2.shape, X_test2.shape) # 3개의 PC score만 저장된 것 확인

(398, 3) (171, 3)

In [64]:

X_train.shape

Out[64]:

(398, 30)

개념대로 적용하기(행렬사용)

In [75]:

pca = PCA()
PCscore_train = pca.fit_transform(X_train)
PCscore_test = pca.transform(X_test)

In [76]:

eigen_value = pca.explained_variance_
eigen_value.shape

Out[76]:

(30,)

In [77]:

eigen_vector = pca.components_.transpose() # 고유벡터의 전치행렬, eT
# 첫번째 열이 첫번째 고유값으로 만든 고유벡터 e1T
# 두번째 열이 두번째 고유값으로 만든 고유벡터 e2T
# 세번째 열이 세번째 고유값으로 만든 고유벡터 e3T...e30T
eigen_vector.shape

Out[77]:

(30, 30)

In [93]:

np.dot(X_train,eigen_vector)  # pc score, Z1...Z30

Out[93]:

array([[-3.08484180e+00, -2.15870446e+00, -3.39874585e-01, ...,
         3.08256329e-02, -1.30215619e-02,  1.01519951e-02],
       [-2.18264681e+00, -6.17571155e-01,  4.47207650e-01, ...,
         4.52299177e-02,  4.55773527e-03,  9.01158415e-04],
       [ 2.04995887e+00,  2.32895331e+00,  1.16940721e+00, ...,
         3.67990822e-02,  4.25992147e-03, -6.02878883e-03],
       ...,
       [-4.55371608e+00, -3.14400031e+00, -2.12369915e-02, ...,
         1.46758031e-02,  3.30986949e-03, -1.37144339e-02],
       [ 7.20423555e-01,  5.47831401e-01, -2.74887748e+00, ...,
        -2.87129689e-02, -1.81522701e-02, -3.54864423e-03],
       [-3.41936854e+00, -1.33970253e+00, -7.96429006e-03, ...,
        -3.12956806e-03,  1.28114834e-02,  6.54333122e-04]])

In [94]:

PCscore_train

Out[94]:

array([[-3.08484180e+00, -2.15870446e+00, -3.39874585e-01, ...,
         3.08256329e-02, -1.30215619e-02,  1.01519951e-02],
       [-2.18264681e+00, -6.17571155e-01,  4.47207650e-01, ...,
         4.52299177e-02,  4.55773527e-03,  9.01158415e-04],
       [ 2.04995887e+00,  2.32895331e+00,  1.16940721e+00, ...,
         3.67990822e-02,  4.25992147e-03, -6.02878883e-03],
       ...,
       [-4.55371608e+00, -3.14400031e+00, -2.12369915e-02, ...,
         1.46758031e-02,  3.30986949e-03, -1.37144339e-02],
       [ 7.20423555e-01,  5.47831401e-01, -2.74887748e+00, ...,
        -2.87129689e-02, -1.81522701e-02, -3.54864423e-03],
       [-3.41936854e+00, -1.33970253e+00, -7.96429006e-03, ...,
        -3.12956806e-03,  1.28114834e-02,  6.54333122e-04]])

In [96]:

eigen_value = pca.explained_variance_ # 고유값
choose = []
for i in eigen_value:
    a = i/np.sum(eigen_value)
    choose.append(a)
    
choose # Z3부터 값이 확 작아짐 -> Z1,Z2 까지만 선택해도 될듯하다

Out[96]:

[0.43167479744322024,
 0.19845651750077126,
 0.09733159192997186,
 0.0653157415624998,
 0.05212151207016257,
 0.04198959644450524,
 0.022634609138970724,
 0.01682668652604661,
 0.012946900414362205,
 0.012094099143067838,
 0.010571854106342564,
 0.008992780948138657,
 0.008094113609679003,
 0.005107617279145752,
 0.0028272393829968587,
 0.0022931012582374033,
 0.001982055813007543,
 0.0017973165949155163,
 0.001662632381205944,
 0.0010560996593097865,
 0.00091814344492389,
 0.0009039266942731821,
 0.0007827532266454116,
 0.0005688006003322696,
 0.0005033805633839237,
 0.00024412978420170582,
 0.0002231853409664187,
 4.963272673588701e-05,
 2.5302419153286345e-05,
 3.881992826491497e-06]

PCA 시각화

In [97]:

plt.figure(figsize=(8,6))
plt.scatter(X_train2[:,0],X_train2[:,1],c=y_train,cmap='plasma')
plt.xlabel('첫번째 주성분')
plt.ylabel('두번째 주성분')

Out[97]:

Text(0, 0.5, '두번째 주성분')

C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 52395 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 48264 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 51704 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 51452 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 49457 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 48516 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 46160 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 52395 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 48264 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 51704 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 51452 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 49457 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 48516 missing from current font.
  font.set_text(s, 0, flags=flags)
C:\Users\Hera\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 46160 missing from current font.
  font.set_text(s, 0, flags=flags)

-> 두가지 주성분(Z1,Z2)만 사용해도 y분류가 잘 되어진 것을 확인!

이제 주성분만을 사용해서 다시 SVM을 해보자!¶

In [82]:

# hyperparameter tuning
param_grid = {'C': [0.01,0.1,0.5,1,10,100],
             'gamma': [1,0.75,0.5,0.25,0.1,0.01,0.001],
             'kernel': ['rbf','poly','linear']}

grid = GridSearchCV(SVC(),param_grid=param_grid, refit=True, verbose=1, cv=5, iid=True)
grid.fit(X_train2,y_train)

print(grid.best_params_)

Fitting 5 folds for each of 126 candidates, totalling 630 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

{'C': 1, 'gamma': 0.75, 'kernel': 'poly'}

[Parallel(n_jobs=1)]: Done 630 out of 630 | elapsed:   18.1s finished
C:\Users\Hera\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  warnings.warn(

In [83]:

# 모델 적합
svm_clf2 = SVC(C=1,gamma=0.75,kernel='poly')
svm_clf2.fit(X_train2,y_train)

Out[83]:

SVC(C=1, gamma=0.75, kernel='poly')

In [84]:

# 예측력 확인
y_pred = svm_clf2.predict(X_test2)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[ 59   4]
 [  2 106]]
0.9649122807017544

-> 3개의 변수(Z1,Z2,Z3)만으로 높은 예측력 가능

In [ ]:

'Data' 카테고리의 다른 글

[python] 멜론차트 크롤링 ( Selenium, BeautifulSoup ) (741)	2021.02.17
[python] 네이버 뉴스 기사 작성일, 제목, url 크롤링 ( BeautifulSoup ) (378)	2021.02.16
[R] 데이터 시각화 with R ( 롤리팝차트 / 덤벨차트 / 슬로프차트 ) (387)	2021.01.28
[R] 데이터 시각화 with R ( 막대그래프/와플차트 ) (384)	2021.01.26
[R] R로 하는 텍스트 전처리2 ( 동시 출현 빈도 / tf-idf/ wordcloud2) (feat. 기리보이) (372)	2021.01.15

Data Viz

[python] SVM, PCA kaggle 필사 (유방암 데이터)

kaggle Support Vector Machine PCA 필사¶

데이터 시각화¶

model training¶

SVM을 위한 데이터 준비¶

hyperparameter tuning¶

PCA¶

이제 주성분만을 사용해서 다시 SVM을 해보자!¶

'Data' 카테고리의 다른 글

티스토리툴바

[python] SVM, PCA kaggle 필사 (유방암 데이터)

kaggle Support Vector Machine PCA 필사¶

데이터 시각화¶

model training¶

SVM을 위한 데이터 준비¶

hyperparameter tuning¶

PCA¶

이제 주성분만을 사용해서 다시 SVM을 해보자!¶

'Data' 카테고리의 다른 글

'Data' Related Articles

티스토리툴바