XGBoost Custom Objective and Evaluation Metric

Machine Learning & Deep Learning 2020. 6. 9. 00:54

먼저 XGBoost가 무엇인지, 모델 학습에 있어서 objective function이 지니는 의미는 무엇인지 간단히 짚고 가겠다.

XGBoost는 eXtreme Gradient Boosting의 약자로,

AdaBoost, Gradient Tree Boosting과 같은 Boosting 계열 algorithm이다.

여기서 Boosting이란 Bagging과 더불어 Ensemble(앙상블) 기법들 중 하나를 말한다.

Bagging은 parallel하게(병렬적으로) model을 학습하고, 각 model의 결과를 최종 결정에서 고려하는 반면,

Boosting은 연속적으로 이전 model의 결과를 고려하는 방식이다. 즉 model들이 서로 dependent하다.

* Ensemble이란 여러 개의 weak learner들을 결합하여 성능을 향상시키는 머신러닝 기법이다.

Ensemble을 하는 수학적인 이유(bias와 variance)와 관련된 여러가지 기법에 대한 내용은 추후 다룰 예정.

XGBoost는 classification(분류)와 regression(회귀) 문제에 모두 적용 가능하다.

XGBoost의 (general) parameter들 중 하나인 'booster'를 통해서 분류 또는 회귀를 위한 모델을 선택할 수 있다. (default 값은 gbtree)

(이외의 parameter들은 https://xgboost.readthedocs.io/en/latest/parameter.html을 참고)

booster의 종류에는 세가지가 있는데, 이는 아래와 같다.

gbtree	decision tree based model들을 이용	-
gblinear	linear function들을 이용	-
dart	decision tree based model들을 이용	gbtree + dropout(overfitting을 막기 위한 기법 중 하나)

https://www.statworx.com/at/blog/xgboost-tree-vs-linear/에 따르면,

classification 문제에선 gbtree와 gblinear가 비슷한 성능을 보였고, regression 문제에선 gbtree 더 좋았다고 한다.

그러나 이는 특정 simulation 내에서 이루어진 것이므로, 일반화하긴 힘들 것 같다.

Python의 경우 보통, Learning API나 Scikit-Learn API (a.k.a Scikit-Learn Wrapper interface)를 통해 사용한다.

(이외에도 XGBoost 자체의 Core Library를 사용하는 방법 등이 있다. 자세한 내용은 아래 링크를 참고)

https://xgboost.readthedocs.io/en/latest/python/python_api.html

아래는 예시 코드이다.

import xgboost as xgb

# 1. Learning API 이용
# xgb.train()을 사용하면 된다.
model = xgb.train(params, dtrain, num_boost_round=10)

# 2. Scikit-Learn API 이용
# model을 선언하고, model.fit()을 하면 학습이 된다.

# for classification
model = xgb.XGBClassifier()

# for regression
model = xgb.XGBRegressor()

model.fit(train_X, train_y)

머신러닝 모델을 학습할 때, 학습에 있어서 그 방향성은 (오차를 나타내는) objective function의 값을 줄이는 방향으로 이루어진다.

즉, objective function이 최소화되는 방향으로 parameter들을 조정한다는 것이다.

XGBoost내에서 objective function은 모델의 예측값과 정답을 input으로 받아 gradient와 hessian을 return한다.

(~~이 부분도 추후 다른 포스트에서 다루겠습니다 ㅜㅜ 자세히 쓰려고 하니 끝이 없네요~~)

결국, objective function을 customizing 한다는 것은, gradient와 hessian을 우리 의도대로 바꿔주는 것을 의미한다.

(gradient를 한번 더 미분한 것이 hessian이므로, 엄밀하게 말한다면 본 함수의 꼴을 목적에 맞게 바꿔주는 것이다.)

Evaluation Function은 학습 과정에는 영향을 미치지 않고, 학습 후 모델의 성능 평가를 위해 쓰이는 함수를 의미한다.

자주 사용하는 예시로 recall, specificity, precision, f1-score, AUC, MSE, MAE, R², Adjusted R² 등이 있다.

XGBoost에서 evaluation function은 objective function과 마찬가지로, 모델의 예측값과 정답을 input으로 받아 'function 이름'과 '계산 값' 두개를 return 한다.

방법은 다음과 같다.

Learning API에선 train method의 obj과 feval에,

Scikit-Learn API에서는 model 선언 시 설정하는 objective와, 모델 학습 때 사용하는 fit method의 eval_metric에

각각 우리가 정의한 objective function과 evaluation function를 맵핑하므로써 customizing 할 수 있다.

예시 코드로 자세히 살펴보자.

보통 오차의 제곱을 데이터 개수로 나눈 MSE를 objective function으로 하는데,

우리는 데이터 개수로 나누지 않고, 오차의 네제곱을 objective function으로 해보겠다.

objective function = (예측값 - 실제값)^4

* 이때 이와 같은 방법으로 customizing할 경우, 오차의 짝수 제곱으로 해야된다.

오차의 홀수제곱을 하면 grad의 부호가 없어지게 되고(항상 양수), 아래처럼 학습이 되지 않는다.

그리고 성능을 평가하는 evaluation metric의 경우 MSE 대신, 오차의 절댓값의 평균을 사용하겠다.

evaluation function = |예측값 - 실제값| 의 평균

Learning API의 경우,

# 1 : Learning API
# objective function과 evaluation function 모두 preds(np.ndarray), dtrain(xgb.DMatrix) 형태로 parameter를 받는다.

def custom_objective(preds, dtrain):
    labels = dtrain.get_label()
    grad = 4*(preds - labels)**3
    hess = 12*(preds - labels)**2
    return grad, hess

def custom_evaluation(preds, dtrain):
    labels = dtrain.get_label()
    return 'custom_error', np.absolute(np.subtract(preds, labels)).mean()
    
# DMatrix 만드는 예시
dtrain=xgb.DMatrix(data=train_X, label=train_y)

dval=xgb.DMatrix(data=val_X, label=val_y)

dtest=xgb.DMatrix(data=test_X, label=test_y)

watchlist=[(dtrain, "train"), (dval, "val")]

# model training
model = xgb.train(params=params, dtrain=dtrain, evals=watchlist, early_stopping_rounds=40, num_boost_round=10, obj=custom_objective, feval=custom_evaluation)

Scikit-learn API의 경우,

# 2 : Scikit learn API 이용
# objective function은 preds(np.ndarray), labels(np.ndarray)
# evaluation function은 preds(np.ndarray), dtrain(xgb.DMatrix) 형태로 parameter를 받는다.

def custom_objective(labels, preds):
    """
    custom_objective 함수의 parameter가 
    기존의 (preds, labels) 순서와는 반대로 되어있는 것에 주의하자!
    """
	
    grad = 4*(preds - labels)**3
    hess = 12*(preds - labels)**2
    return grad, hess

def custom_evaluation(preds, dtrain):
    labels = dtrain.get_label()
    return 'custom_error', np.absolute(np.subtract(preds, labels)).mean()

model=xgb.XGBRegressor(objective=custom_objective)

# model training
model.fit(train_X, train_y, eval_set=[(train_X, train_y), (val_X, val_y)], eval_metric=custom_evaluation, verbose=True)