文章目錄[隱藏]
Kaggle ICR比賽現在在進行中,這個(ge) 比賽是一個(ge) 典型的數據挖掘比賽,很適合入門學習(xi) 。本文將介紹現在ICR基礎的解決(jue) 方案。
- 賽題名稱:ICR - Identifying Age-Related Conditions
- 賽題任務:數據挖掘
- https://www.kaggle.com/competitions/icr-identify-age-related-conditions
賽題任務
比賽數據包含與(yu) 三種與(yu) 年齡相關(guan) 的狀況相關(guan) 聯的五十多個(ge) 匿名健康特征。您的目標是預測受試者是否被診斷出患有這些病症之一——二元分類問題。
步驟1:讀取數據集
- train.csv訓練集
- test.csv - 測試集。
- greeks.csv:訓練集元數據
COMP_PATH = "/kaggle/input/icr-identify-age-related-conditions" train = pd.read_csv(f"{COMP_PATH}/train.csv") test = pd.read_csv(f"{COMP_PATH}/test.csv") sample_submission = pd.read_csv(f"{COMP_PATH}/sample_submission.csv") greeks = pd.read_csv(f"{COMP_PATH}/greeks.csv")
步驟2:自定義(yi) 評價(jia) 指標
賽題使用的balance log loss,為(wei) 了與(yu) 賽題保持一致,可以自定義(yi) 指標。當然也可以自定義(yi) 目標函數。
def competition_log_loss(y_true, y_pred): N_0 = np.sum(1 - y_true) N_1 = np.sum(y_true) p_1 = np.clip(y_pred, 1e-15, 1 - 1e-15) p_0 = 1 - p_1 log_loss_0 = -np.sum((1 - y_true) * np.log(p_0)) / N_0 log_loss_1 = -np.sum(y_true * np.log(p_1)) / N_1 return (log_loss_0 + log_loss_1)/2
def balanced_log_loss(y_true, y_pred):
N_0 = np.sum(1 - y_true)
N_1 = np.sum(y_true)
p_1 = np.clip(y_pred, 1e-15, 1 - 1e-15)
p_0 = 1 - p_1
log_loss_0 = -np.sum((1 - y_true) * np.log(p_0))
log_loss_1 = -np.sum(y_true * np.log(p_1))
w_0 = 1 / N_0
w_1 = 1 / N_1
balanced_log_loss = 2*(w_0 * log_loss_0 + w_1 * log_loss_1) / (w_0 + w_1) return balanced_log_loss/(N_0+N_1)
步驟3:定義(yi) 數據劃分
由於(yu) 數據集存在類別分布不均衡的情況,因此建議按照原信息或比賽標簽進行劃分驗證集。
kf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True) df['fold'] = -1
for fold, (train_idx, test_idx) in enumerate(kf.split(df, greeks['Alpha'])):
df.loc[test_idx, 'fold'] = fold
df.groupby('fold')["Class"].value_counts()
步驟4:模型訓練與(yu) 驗證
由於(yu) 比賽是典型的數據挖掘賽題,因此建議使用lightgbm。然後在訓練中,可以進行調參,加入early stop。
並記錄下每折的精度,按照每折的權重作為(wei) 最終的加權。這也是一種集成方法。
weights = [] for fold in range(5): train_df = df[df['fold'] != fold] valid_df = df[df['fold'] == fold] valid_ids = valid_df.Id.values.tolist() X_train, y_train = train_df.drop(['Id', 'Class', 'fold'], axis=1), train_df['Class'] X_valid, y_valid = valid_df.drop(['Id', 'Class', 'fold'], axis=1), valid_df['Class'] # 使用lightgbm進行訓練和驗證 lgb = LGBMClassifier(boosting_type='goss', learning_rate=0.06733232950390658, n_estimators = 50000, early_stopping_round = 300, random_state=42, subsample=0.6970532011679706, colsample_bytree=0.6055755840633003, class_weight='balanced', metric='none', is_unbalance=True, max_depth=8) # 存儲(chu) 每折的權重 weights.append(1/balanced_logloss)
步驟5:模型預測
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ['Id', 'class_0', 'class_1']
final_valid_predictions.to_csv(r"oof.csv", index=False)
test_dict = {}
test_dict.update(dict(zip(test.Id.values.tolist(), test_preds)))
submission = pd.DataFrame.from_dict(test_dict, orient="index").reset_index()
submission.columns = ['Id', 'class_0', 'class_1']
submission.to_csv(r"submission.csv", index=False)
submission
代碼地址:https://www.kaggle.com/code/chaitanyagiri/icr-2023-single-lgbm-0-12-cv-0-16-lb
評論已經被關(guan) 閉。