24.05.27 알고리즘 개념(Decision Tree)

Algorithm 종류

Decision Tree
- Classfication/ Regression 모두 가능
- 최대 장점: 미리 예측할 수 있음 (초반에 많이 사용) = set of rules(일련의 규칙)을 알려줌
  e.g. 알파고의 경우 "여기에 둬라"라는 답만 줄뿐, "왜" 인지에 대해서는 답을 못주지만, Decision Tree의 경우 이유를 알려줌(규칙 제공)

Classification Tree

Classify or predict an outcome based on a set of predictors
The output is a set of rules
Terminology

CART(classification and regression tree --> Decision tree의 종류)
- Advatages
  - Normalization, missing value 등의 data가 있어도 어느정도 돌아감 (한마디로 대충해도 어느정도는 된다)
  - Numerical and cetegorical data를 모두 다룰 수 있음
- Disadvantages
  - 한번에 변수 하나씩
Recursive Partitioning
- Partition the data in a parent node into two child nodes using a certain value of a certain variable
(data를 만족스러운 결과가 나올때까지 child node로 부터 반복적으로 구분하겠다)
- Select the split point to maximize the purity of the child nodes (분할하는 기준은 child node의 purity가 줄어드는 것)
- Gini-index(for cetegorical variable) and the variance (for numerical variable) are used to measure the impurity of a
node
- Measuring Impurity I: Gini Index
  - Gini index for rectangle A containing m class

- 한 영역이 전부 같은 범주이면 = 0
- 각각 범주가 균등하게 퍼저있으면 = 0.5
- 따라서 0.47이면 균등하게 퍼져있다고 판단 됨(불순도가 가장 높은 수준)

위의 parent node를 child node 두개로 나눴을 때의 Gini Index 값 구하기

- 각각의 영역에 대해서 Gini Index를 따로 계산한 다음, 각각의 공간에 대한 가중치를 곱해 총 I를 구했을 때
Information gain이 0.13 발생 (즉, purity가 좋아졌다)

Pruning
- leaf node의 purity가 1이 됐을 때 (full tree)는 학습하지 말하야할 noise 값 까지 학습 된 것
(y = f(x)까지가 아닌 y = f(x) + b --> 다른 data에서는 맞아떨어지지 않을 수 있음)
- 이를 방지하기 위해서 가지치기를 하다가 어느 선에서 끊어줌

- 끊어주는 기준 = cost complexity

Cost complexity

- CC(T) = cost complexity of a tree
- ERR(T) = proportion of misclassified records in the validation data (오분류된 기록의 비율 = validation errer)
- Alpha = penalty factor attached to the tree size (set by the user)
- L(T) = number of leaf nodes
- Alpha = 가중치 (보정값)
# 즉, 1) nodes 수가 많더라도 validation error 가 낮은 것을 선택. 2) validation error 가 같다면 nodes 수 적은 것을
선택하겠다.

Regression Tree

Output of a leaf node

Prediction of node

Impurity
- 최초 x1 평균이 15였을 것, y1 평균 = 10, y2 평균 =20
- 각각의 data와 평균의 변수의 합
# 즉, 평균에서 얼마나 떨어져있는가로 regression tree에서는 impurity를 계산함
# Information gain = 250
Information Gain

즉 parent node의 impurity index - 각각의 child node의 impurity index를 빼준 값

'TIL(2024y) > Machine learning' 카테고리의 다른 글

24.06.10 Review (Summary/ Linear Regression with Pytorch) (0)	2024.06.11
24.05.28 Ensemble model (Bagging/ Random tree) (0)	2024.05.28
24.05.24 통계 / scaling / data split (0)	2024.05.26
24.05.23 데이터 시각화 (0)	2024.05.23
24.05.22 머신러닝 개념 및 이상치/결측치 해결 (0)	2024.05.22

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Happy Life

24.05.27 알고리즘 개념(Decision Tree)

Algorithm 종류

Classification Tree

Regression Tree

'TIL(2024y) > Machine learning' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

24.05.27 알고리즘 개념(Decision Tree)

Algorithm 종류

Classification Tree

Regression Tree

'TIL(2024y) > Machine learning' 카테고리의 다른 글

'TIL(2024y)/Machine learning' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역