Kaggle賽題總結：穀歌手語識別

文章目錄[隱藏]

賽題背景
賽題任務
賽題數據集
評價指標
優勝方案
- 第一名
- 第二名
- 第三名
- 第四名
- 第五名

Google Isolated Sign Language Recognition

https://www.kaggle.com/c/asl-signs/

賽題類型：深度學習(xi) 、時間序列

賽題背景

在美國每天有 33 名嬰兒(er) 出生時患有永久性聽力損失，其中大約 90% 的父母是聽力正常的人，其中許多人可能不懂美國手語。如果沒有手語，聾啞嬰兒(er) 有患上語言剝奪綜合症的風險。

PopSign 是一款智能手機遊戲應用程序，它使學習(xi) 美國手語變得有趣、互動且易於(yu) 訪問。玩家將 ASL 標誌的視頻與(yu) 包含書(shu) 麵英語單詞的泡泡相匹配以彈出它們(men) 。

賽題任務

本次比賽的目標是對孤立的美國手語 (ASL) 標誌進行分類。您將創建一個(ge) TensorFlow Lite模型，該模型需要在有指定數據集上進行預測。

Kaggle賽題總結：穀歌手語識別

賽題數據集

train_landmark_files/按照文件的方式存儲(chu) 了✋?在不同幀下的空間位置。

train.csv：手語標簽

評價指標

本次比賽的評估指標是簡單的分類準確率，參賽選手將提交一個(ge) TensorFlow Lite 模型文件。該模型必須將一個(ge) 或多個(ge) 地標幀作為(wei) 輸入，並返回一個(ge) 浮點向量（每個(ge) 標誌類別的預測概率）作為(wei) 輸出。

模型必須以在100ms內(nei) 進行單個(ge) 樣本的預測，並且模型的權重文件應該小於(yu) 40MB。

優勝方案

第一名

https://www.kaggle.com/competitions/asl-signs/discussion/406684

我的解決(jue) 方案涉及一維 CNN 和 Transformer 的組合，使用所有訓練數據（僅(jin) 比賽數據）從(cong) 頭開始訓練，並使用4種子集成進行提交。

我最初使用 PyTorch + GPU，但後來切換到 TensorFlow + Colab TPU(tpuv2-8) 以確保與(yu) TensorFlow Lite 的兼容性。

如果幀間相關(guan) 性很強，一維 CNN 會(hui) 比 Transformer 更有效。在我的實驗中，純 1D CNN 的性能輕鬆超過了 Transformer。 因此我最終僅(jin) 使用 1D CNN 就獲得了 0.80 的公共 LB 分數。然而Transformer 仍然有作用，可以在 1D CNN 之上使用（我們(men) 可以將 1d cnn 視為(wei) 某種可訓練的分詞器）。

Regularization
- Drop Path(stochastic depth, p=0.2)
- high rate of Dropout (p=0.8)
- AWP(Adversarial Weight Perturbation, with lambda = 0.2)
Augmentation
- hflip
- Random Affine(Scale, shift, rotate, shear)
- Random Cutout
- Random resample (0.5x ~ 1.5x to original length)
- Random masking
- temporal augmentation
- Spatial augmentation

第二名

https://www.kaggle.com/competitions/asl-signs/discussion/406306

我們(men) 使用了一種類似於(yu) 使用 EfficientNet-B0 模型的音頻頻譜圖分類的方法，使用大量增強和轉換器模型（例如 BERT 和 DeBERTa）作為(wei) 輔助模型。

最終的解決(jue) 方案包括一個(ge) 輸入大小為(wei) 160x80 的 EfficientNet-B0，在 8 個(ge) 隨機分割折疊中的單個(ge) 折疊上訓練，以及在完整數據集上訓練的 DeBERTa 和 BERT。使用 EfficientNet 的單折模型的 CV 得分為(wei) 0.898，排行榜得分約為(wei) 0.8。

CNN預處理
- 提取了 18 個嘴唇點、20 個姿勢點（包括手臂、肩膀、眉毛和鼻子）以及所有手部點，總共 80 個點。
- 應用了各種增強和標準規範化。
- 沒有丟棄 NaN 值，而是在歸一化後用零填充它們。
- 使用“最近”插值將時間軸插值到
Transformer預處理
- 保留了61個穴位，其中唇穴40個，手穴21個。對於左手和右手，保留 NaN 較少的那個。如果保留右手，則將其鏡像到左手。
- 依次應用增強、歸一化和 NaN 填充。
- 長於 96 的序列被內插到 96。短於 96 的序列保持不變。
- 除了原始位置外，還使用了手工製作的特征，包括運動、距離和角度的餘弦。
Augmentations
- Random affine
- Random interpolation
- Flip pose
- Finger tree rotate

第三名

https://www.kaggle.com/competitions/asl-signs/discussion/406568

我們(men) 使用了六個(ge) conv1d 模型的和兩(liang) 個(ge) Transformer模型。方案的關(guan) 鍵點是數據預處理、硬增強和集成。

Preprocessing
- 20 lip points, 32 eyes points, 42 hands points(left hand and right hand) and 8 pose points.
- input sequence is normalized with shoulder, hip, lip and eyes points.
- Filling the NaN values with 0.0
- Learn a motion embedding by input sequence
Augmentation
- Global augmentation (apply same aug for all frames), including rotation(-10,10), shift(-0.1,0.1), scale(0.8,1.2), shear(-1.0,1.0), flip(apply for some signs)
- Time-based augmentation (apply aug for some frames), random select some frames(1-8) do affine augmentations, random drop frames (fill with 0.0)

第四名

https://www.kaggle.com/competitions/asl-signs/discussion/406673

Modeling
- The first is a model that classifies fixed-length sequences (1DCNN-FixLen).
- The second is a model that classifies variable-length sequences (1DCNN-VariableLen).
Data Augmentation
- Randomly drop frames (p=0.3).
- Augment hand position, size, and angle.
Preprocessing
- Use XY coordinates
- Normalize the coordinates between the eyebrows to (0,0).
- Compare the number of frames detected for the right and left hands, and flip.
- Use XY coordinates of 21 feature points of the right hand (flip left hand) and 40 feature points of the lips.
- Delete frames in which the feature points of the hand have not been detected.