24.05.29 Ensemble model/ Neural network

Ensemble moel

Boosting

Tree에서 실수한 data로 학습

Boosting model의 종류
- Adaboost: 오분류된 샘플에 더 많은 가중치를 주는 방식
- Gradient boosting: 각 트리의 실수(failure)를 살펴보고 실수에 대한 새로운 트리를 만드는 방식 (error률이 거의 0)
  - XGBoost (extreme gradient) : 계산속도가 빠름/ overfitting 방지
  - lightGBM
  - Catboost
모델이 정답을 가장 잘 찾는 weights를 찾는것 -> 머신/딥러닝의 최대 목표
Concept

error값이 0에 가까워질때까지 무한 반복한다

XGBoost 사용시 정해줘야하는 facrots

objective가 어디에 속하는지 정해주기
learning_rate: weights를 갱신할 때 보폭을 정해주는 수치(?).
e.g. learning_rate를 크게주면 최소값을 향해 갈때 보폭이 크고 최소값을 뛰어넘어 발산할 수 있음

*Coding

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib as mpl

import matplotlib.pyplot as plt

import matplotlib_inline.backend_inline

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

import xgboost

from xgboost import XGBClassifier

import time

file 불러오기

train = pd.read_csv("../240529/hr_train_scaling.csv",

index_col = 0)

test = pd.read_csv("../240529/hr_test_scaling.csv",

index_col = 0)
train/ test data split

x_train = train[train.columns[:-1]].values

y_train = train["attrition"].values

x_test = test[test.columns[:-1]].values

y_test = test["attrition"].values
Randomtree modeling
- grid_search() 함수로 주어진 머신러닝 모델과 하이퍼파라미터 그리드에 대해 교차 검증을 사용하여 최적의 매개변수를 찾고,
- 훈련 데이터와 테스트 데이터에 대한 모델의 성능을 평가
  
  def grid_search(params, model, core):
  #params(hyperparameter): 델의 매개변수 조합을 시도할 값들을 포함한 딕셔너리
  # model: 사용할 머신러닝 모델
  # core: 병렬 처리를 위한 CPU 코어 수. -1로 설정하면 모든 가용 코어를 사용
  
  model_grid = GridSearchCV(model,
  
  params,
  
  cv=5,
  
  return_train_score=True,
  
  n_jobs = core
  
  )
  
  model_grid.fit(x_train, y_train)
  
  # return_train_score=True는 훈련 데이터에 대한 점수도 반환하도록 설정
  
  # n_jobs=core는 병렬 처리를 위해 사용할 CPU 코어 수를 지정
  
  # model_fit(x_train, y_train): 훈련 데이터를 사용하여 그리드 서치를 수행
  
  print('최상의 매개변수: ', model_grid.best_params_)
  
  print('훈련 데이터의 최고 정확도: ', model_grid.best_score_)
  
  model_best = model_grid.best_estimator_
  
  pred_best = model_best.predict(x_test)
  
  print('테스트 데이터의 최고 정확도: ', metrics.accuracy_score(pred_best, y_test))
- Accuracy 확인
  
  rfc = RandomForestClassifier(oob_score=True, random_state=209)
  
  #oob_score = True: out of bag 샘플을 사용하여 모델의 일반화 성능을 평가하겠다는 의
  
  rfc.fit(x_train, y_train)
  
  # x_train과 y_train 데이터를 사용하여 랜덤 포레스트 모델을 학습시키겠다
  
  pred_rfc = rfc.predict(x_test)
  
  print('The accuracy of the RFC is', metrics.accuracy_score(pred_rfc, y_test))
- Tree의 개수에 따른 oob score을 계산하고 tree 개수와 oob score의 상관관계에 대한 그래프 그리기
  
  oob_score = []
  
  estimators = [i for i in range(100, 400, 50)]
  
  for i in estimators :
  
  rfc = RandomForestClassifier(n_estimators = i,
  
  oob_score = True,
  
  random_state = 209,
  
  #n_jobs = -1
  
  )
  
  rfc.fit(x_train, y_train)
  
  oob_score.append(rfc.oob_score_)
  
  # 학습된 모델의 OOB 점수를 계산하여 oob_score 리스트에 추가
  
  fig, ax = plt.subplots(figsize=(5,4))
  
  ax.plot(estimators, oob_score, marker = "o")
  
  ax.set(xlabel = "트리의 개수",
  
  ylabel = "OOB SCORE",
  
  title = "트리의 개수별 OOB SCORE 비교");
XGBoost
- XGBoost modeling(이진 분류 문제)
  
  xgb = XGBClassifier(booster = "gbtree",
  
  objective = "binary:logistic")
  
  # booster = "gbtree": Booster)로 Gradient Boosted Trees를 사용하겠다는 의미
  
  # 이진 분류 문제에서 로지스틱 회귀를 목표 함수로 사용하겠다는 의미
  
  xgb.fit(x_train, y_train)
  
  pred_xgb = xgb.predict(x_test)
  
  print('The accuracy of the xgboost is', metrics.accuracy_score(pred_xgb, y_test))
  
  #정확도 확인하기
- parameter을 지정하여 best_estimator를 찾기
  
  params = {"max_depth": [1, 2, 3],
  
  "learning_rate" : np.arange(0.01, 0.1, 0.001),
  
  "n_estimators": np.arange(100, 300, 30)}
  # max_depth: 트리의 최대 깊이를 설정
  # learning_rate: 학습률을 설정
  # n_estimators: 사용할 트리의 개수를 설정
  
  base_xgb = XGBClassifier(booster = "gbtree",
  
  objective = "binary:logistic")
  
  # XGBClassifier: XGBoost 분류기 객체를 생성
  # booster = "gbtree": 결정 트리 기반의 부스터를 사용
  # objective = "binary:logistic": 이진 분류 문제를 위해 로지스틱 회귀를 목표 함수로 사용
- gread_search()를 사용하여 다양한 hyperparameter 중 최적의 매개변수를 찾기
  
  grid_search(params, base_xgb, -1)
- XGBoost 분류기를 생성하고 훈련
  
  xgb_best = XGBClassifier(booster = "gbtree",
  
  objective = "binary:logistic",
  
  learning_rate = 0.0301,
  
  max_depth = 3,
  
  n_estimators = 250)
  
  # XGBClassifier를 사용하여 분류기를 생성
  # booster = "gbtree": XGBoost의 기본 부스터인 Gradient Boosted Trees를 사용
  # objective = "binary:logistic": 이진 분류 문제를 다루기 위해 로지스틱 회귀 목적 함수를 사용
  # learning_rate = 0.0301: 학습률을 0.0301로 설정
  # n_estimators = 250: 사용할 트리의 개수를 250으로 설정
  
  xgb_best.fit(x_train, y_train)
  # x_train과 y_train을 사용하여 모델을 훈련
- Matplotlib을 사용하여 두 개의 모델인 rfc와 XGBoost(xgb)의 특성 중요도를 막대 그래프로 시각화
  
  fig, ax = plt.subplots(figsize=(8,3))
  
  ax.bar(train.columns[:-1],
  
  rfc_best.feature_importances_,
  
  color = "tab:blue",
  
  label = "rfc",
  
  alpha = 0.3)
  
  ax.bar(train.columns[:-1],
  
  xgb_best.feature_importances_,
  
  color = "tab:red",
  
  label = "xgb",
  
  alpha = 0.3)
  
  ax.legend()
  
  plt.setp(ax.get_xticklabels(), ha="right", rotation=45);

Neutral network

Neutral network

단일 neuron의 role을 mimic하는 concept
Purpose of perceptron: fine the weight w that can best match the input(x) and the target(t)
--> input과 target 사이에 값이 잘 맞춰지도록 하는 weight 값을 찾

Hidden node

Take the weighted sum of input values and perform a non-linear activation
즉, 여러 변수들의 정보를 나름대로 취합해서 -> 얼마만큼 다음 단계로 전달할지 결정함 (activation)
모든 neuron network의 parameter(미지수)는 weight임
모든 activation은 non-linear(linear하면 linear regression 쓰면됨)

Representative activation functions
- Sigmoid(Logist or Logit): the most commonly used activation, [0,1] range, learning speed is relatively slow
- Tanh: similar to sigmoid but [-1, 1] range, learning speed is relatively fast
- ReLU(Rectified linear unit): very fast learning speed, easy to compute (without exponential function)
  --> 0보다 작은 값은 0으로 취급

How do we know that the relationship is accurately found?
- Use a loss function (how the output y is close to the target t)
  # target은 우리가가지고 있는 정답, y는 percentron에 의해 예측된 값
  - Regression: squared loss is commonly used
  - Classification: cross-entropy is usually used
  # target과 y의 차이에 대한 제곱 = loss
  - Cost function: the average of loss function values
Gradient Descent
- gradient가 0인지 확인
  - Yes: current weights가 최적이기에 leaning 종료
  - No: current weights 가아니기에 learning 계속
- 어떻게 weight를 향상시키는가?
  - 최소값 0으로 갈때까지, gradient의 반대방향으로
- 얼마나 weights가 move해야하는가?
  - 조금씩 움직이며 맞춰질때까지~ 애매~

p.s. 참고문헌들 출처 기재하지 않아서 만드신분들께 죄송합니다.

저작자표시 비영리 변경금지 (새창열림)

'TIL(2024y) > Deep learning' 카테고리의 다른 글

24.07.08 Auto encoder (0)	2024.07.08
24.07.03 CNN (0)	2024.07.06
24.07.01 Deep learning (구조 및 역할) (0)	2024.07.01
24.06.12 Review(Decision Tree/ Random Tree) (1)	2024.06.12
24.05.30 SLP/ MLP (0)	2024.05.30

Happy Life

24.05.29 Ensemble model/ Neural network

Ensemble moel

Neutral network

'TIL(2024y) > Deep learning' 카테고리의 다른 글

티스토리툴바

24.05.29 Ensemble model/ Neural network

Ensemble moel

Neutral network

'TIL(2024y) > Deep learning' 카테고리의 다른 글

'TIL(2024y)/Deep learning' Related Articles

티스토리툴바