[랩_스터디 과제 2] 발표준비

💻 본업(컴공생) 이야기/머신러닝 - 딥러닝 이야기

[랩_스터디 과제 2] 발표준비

st오리🐥 2025. 1. 11. 10:31

728x90

SMALL

총 5분 발표

2분 : 배운 이론 간단히 요약
- Decision tree classifier
- Decision tree regressor
- Random forest
- GridSearchCV (sklearn)

3분 : 배운 이론에 따른 순서와 흐름별 코드리뷰

1️⃣ Decision Tree Classifier

• Decision Tree Classifier는 데이터를 분류하기 위해 계층적으로 분할하는 모델이다.

• 목적 : 주어진 데이터를 반복적으로 분할하여 불순도를 최소화하고, 각 분할마다 최적의 기준을 선택하기 위함이다.

• 불순도 측정에는 엔트로피(Entropy), 지니 지수(Gini Index)가 사용된다.

2️⃣ Decision Tree Regressor

• Decision Tree Regressor는 연속형 데이터를 예측하는 데 사용된다. 분류와 유사한 방식으로 데이터를 반복적으로 분할하며, 리프 노드에서는 해당 데이터 포인트의 평균값을 예측값으로 사용한다.

• 불순도 측정은 분산(Variance)을 기준으로 하며, 이를 최소화하는 방향으로 데이터를 분할한다.

3️⃣ Random Forest

• Random Forest는 여러 개의 결정 트리를 앙상블하여 예측 성능을 높이는 기법이다.

• 각각의 트리는 데이터의 랜덤 샘플과 랜덤 피처를 사용해 학습되며, 최종 예측값은 트리들의 결과를 투표(분류)하거나 평균(회귀)으로 통합하여 결정한다.

• 랜덤이기 때문에 과적합을 방지하고 안정성을 높일 수 있다.

[특징]
1) 배깅(Bootstrap Aggregating) : 각 트리는 학습 데이터의 부트스트랩 샘플(랜덤하게 복원 추출된 데이터)에서 학습한다.

2)특성 무작위성(Random Feature Selection) : 각 노드를 분할할 때, 모든 특성을 사용하지 않고 일부 랜덤한 특성만 고려하여 트리 간의 상관성을 줄이고 앙상블 효과를 높인다.

3)예측 병합 : 분류는 각 트리의 예측 결과를 투표(Majority Voting)로, 회귀는 각 트리의 예측 값을 평균으로 병합한다.

# 코드 #

1) 클래스 정의 : 생성할 트리의 개수, 트리의 최대깊이, 분할을 위한 최소 샘플 크기, 노드 분할 시 사용할 랜덤특성의 개수, 랜덤포레스트를 구성하는 개별결정트리 정의

class RandomForest:
    def __init__(self, n_trees=10, max_depth=10, min_samples_split=2, n_feature=None):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.n_features = n_feature
        self.trees = []

2) 학습 -> fit 메서드
* self.trees 리스트에 각 트리를 저장
* 부트스트랩 샘플링(_bootstrap_samples)을 통해 랜덤하게 데이터를 샘플링
* 각 부트스트랩 샘플에 대해 새로운 결정 트리를 학습
* 학습된 트리를 self.trees에 추가

def fit(self, X, y):
    self.trees = []
    for _ in range(self.n_trees):
        tree = DecisionTree(max_depth=self.max_depth,
                        min_samples_split=self.min_samples_split,
                        n_features=self.n_features)
        X_sample, y_sample = self._bootstrap_samples(X, y)
        tree.fit(X_sample, y_sample)
        self.trees.append(tree)

3) 부트스트랩 샘플링 : 입력 데이터 X와 레이블 y에서 랜덤하게 복원 추출여 새로운 데이터 샘플 생성

def _bootstrap_samples(self, X, y):
    n_samples = X.shape[0]
    idxs = np.random.choice(n_samples, n_samples, replace=True)
    return X[idxs], y[idxs]

4) 예측:

모든 트리에 대해 예측을 수행(tree.predict(X))
각 샘플에 대한 예측 결과를 모아 다수결 투표로 최종 예측 값을 도출
self._most_common_label(pred)를 통해 각 샘플의 최빈값을 결정

def predict(self, X):
    predictions = np.array([tree.predict(X) for tree in self.trees])
    tree_preds = np.swapaxes(predictions, 0, 1)
    predictions = np.array([self._most_common_label(pred) for pred in tree_preds])
    return predictions

5) 다수결 투표 : 예측된 레이블 중 가장 빈도가 높은 값을 반환하여, 랜덤 포레스트의 예측 병합 과정을 구현

def _most_common_label(self, y):
    counter = Counter(y)
    most_common = counter.most_common(1)[0][0]
    return most_common

4️⃣ GridSearchCV (sklearn)

GridSearchCV는 머신러닝 모델의 하이퍼파라미터 최적화를 위해 사용되는 것으로, 교차 검증을 통해 각 조합의 성능을 평가한다. 이를 통해 최적의 하이퍼파라미터를 찾는 것이 목적이다.

가장먼저 공통적으로 데이터를 불러온 후, 입력 데이터 X와 타겟 데이터 Y를 분리한다.

import numpy as np
import pandas as pd

col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']
data = pd.read_csv("iris.csv", skiprows=1, header=None, names=col_names)

X = data.iloc[:, :-1].values
Y = data.iloc[:, -1].values.reshape(-1, 1)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=41)

2. Decision Tree Classifier 는 분류를 위한 모델이다.

class Node():
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.info_gain = info_gain
        self.value = value

class DecisionTreeClassifier():
    def __init__(self, min_samples_split=3, max_depth=3):
        self.root = None
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth

    def build_tree(self, dataset, curr_depth=0):
        # 최적의 분할 기준 계산
        pass

    def fit(self, X, Y):
        dataset = np.concatenate((X, Y), axis=1)
        self.root = self.build_tree(dataset)

    def predict(self, X):
        predictions = [self.make_prediction(x, self.root) for x in X]
        return predictions

3. Decision Tree Regressor 는 연속형 값을 예측하고, 분산감소를 사용해 데이터를 분할한다.

class DecisionTreeRegressor():
    def __init__(self, min_samples_split=3, max_depth=3):
        self.root = None
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth

    def build_tree(self, dataset, curr_depth=0):
        # 분산 기반으로 데이터 분할
        pass

    def fit(self, X, Y):
        dataset = np.concatenate((X, Y), axis=1)
        self.root = self.build_tree(dataset)

    def predict(self, X):
        predictions = [self.make_prediction(x, self.root) for x in X]
        return predictions

728x90

LIST

'💻 본업(컴공생) 이야기 > 머신러닝 - 딥러닝 이야기' 카테고리의 다른 글

[CNN 공부하기] AlexNet, VGGNet, ResNet 은 뭘까? (2)	2025.02.11
[랩_스터디 과제 3] Gradient descent (경사하강법), how neural networks learn? 🧠 (3)	2025.01.21
[랩_스터디 과제 3] Deep learning (introduction) (5)	2025.01.21
[랩_스터디 과제 2] Decision Tree Regression (2)	2025.01.11
[랩_스터디 과제 2] Decision Tree Classification (7)	2025.01.11

현재글[랩_스터디 과제 2] 발표준비

250x250

다양한 이야기들🍃

안녕하세요 :>

추천시스템, 약물추천ai, ehr데이터, mimic-iii, 딥러닝논문요약, 딥러닝, ai 약물 추천, 약물의 분자구조, safedrug, 의료AI, GraphNeuralNetwork, DDI, ai기술변천사, 머신러닝, 인공지능논문정리, GNN, 머신러닝 의료, 약물추천, 파이토치, 의료데이터,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

다양한 이야기들🍃