小白學NLP：BERT知識表示、訓練和壓縮

文章目錄[隱藏]

BERT模型介紹
BERT「掌握」的知識
BERT模型訓練
- Pre-training BERT
- Fine-tuning BERT
BERT模型壓縮

A Primer in BERTology: What we know about how BERT works, 2022, https://arxiv.org/abs/2002.12327

基於(yu) Transformer的NLP模型現在廣泛應用，但我們(men) 對它們(men) 的內(nei) 部工作原理仍然知之甚少。

本篇論文的內(nei) 容包括：

BERT模型的原理和預訓練
BERT模型適用場景和
BERT模型分支、壓縮和改進方向

BERT模型介紹

BERT是一堆Transformer編碼器組成。對於(yu) 序列中的每個(ge) 輸入，每個(ge) 頭計算鍵、值和查詢向量，用於(yu) 創建加權表示。同一層中所有頭的輸出通過一個(ge) 全連接層運行合並。

原始BERT的訓練流程包括兩(liang) 個(ge) 階段：預訓練和微調。預訓練使用兩(liang) 個(ge) 自監督任務：masked language modeling（MLM，隨機屏蔽輸入的預測）和next sentence prediction（NSP，預測兩(liang) 個(ge) 輸入句子是否彼此相鄰）。在對下遊任務進行微調時，通常會(hui) 在最終編碼器層之上添加一個(ge) 或多個(ge) 全連接層。

BERT首先將給輸入通過wordpieces進行處理，然後將三個(ge) 嵌入層（標記、位置和段）獲得固定長度的向量。特殊標記[CLS]用於(yu) 預測分類，[SEP]分隔輸入輸入段。原始BERT有兩(liang) 個(ge) 版本：base 和 large，分別在層數、隱藏層大小和注意力頭數上存在差異。

BERT「掌握」的知識

語法知識

研究表明BERT表示是分層的而不是線性的，除了詞序信息之外還有類似於(yu) 句法樹結構的知識。

句法結構似乎並沒有直接編碼在自注意力權重中，但它們(men) 可以從(cong) token的表示恢複出來。

小白學NLP：BERT知識表示、訓練和壓縮

BERT對格式錯誤的輸入不敏感，即使打亂(luan) 詞序、截斷句子、刪除主語和賓語，它的預測也沒有改變。

語義知識

BERT能夠做出正確的MLM預測，則不是簡單的填寫(xie) 單詞。BERT可以捕獲實體(ti) 、關(guan) 係和角色等信息。

BERT很難表示數值，並且很難從(cong) 訓練數據中泛化。其中的一個(ge) 原因，可能是wordpieces將數字進行了拆分。

通用知識

BERT通過MLM可以進行預測，並比常規的方法行囊夠更好。但BERT不能直接用於(yu) 邏輯推理。

BERT模型訓練

Pre-training BERT

原始的 BERT是在兩(liang) 個(ge) 任務進行預訓練：下一句預測 (NSP) 和掩碼語言模型 (MLM)。有多項研究對預訓練任務進行了改進：

How to mask
- with corruption rate and corrupted span length
- diverse masks for training examples within an epoch
- replace the MASK token with [UNK] token
What to mask
- applied to full words instead of word-pieces
- mask spans rather than single tokens
- mask phrases and named entities
Where to mask
- arbitrary text streams instead of sentence pairs
- MLM with partially autoregressive LM
Alternatives to masking
- deletion, infilling, sentence permutation and document rotation
- predict whether a token is capitalized and whether it occurs in other segments of the same documen
- train on different permutations of word order in the input sequence, maximizing the probability of the original word order
- detects tokens that were replaced by a generator network rather than masked
NSP alternatives
- remove NSP does not hurt or slightly improves performance
- replace NSP with the task of predicting both the next and the previous sentences
- replace the negative NSP examples by swapped sentences from positive examples
- sentence reordering and sentence distance prediction