Transformer 3mins Overview

Jul 26, 2023

--

Transformer Attention is everything.

論文連結: [1706.03762] Attention Is All You Need (arxiv.org)
Transformer Pytorch: GitHub — hyunwoongko/transformer: PyTorch Implementation of “Attention Is All You Need”

先看一眼Transformer的架構圖，在此篇我們分幾個部分來討論

Decoder and Encoder架構，如果看上圖左半邊為 Encoder而右半邊又為Decoder
Position Encoding
Attention Layer
Normalization

Decoder and Encoder 架構

由上圖可知，其實左半邊Encoder主要是在壓縮Input資訊，而右半邊的Decoder就是不斷的再還原資訊

而一個模組的Encoder and Decoder如下圖所示，由Feed Forward Neural Network與Self-Attention結合而成

在Encoders的部分，每一個Encoder的輸入都是上一個Encoder的輸出

而每一個Decoder的為上一個Decoder的輸出再加上上一層所有Encoder的輸出，也就是每一層Decoder負責產出一部分的詞語再餵入下一個Decoder

因此ChatGPT2的Decoder Stack就是建築再Transformer的Decoder上並且用不同的Decoder Layers組合而成

Position Encoding

Transformer引入了位置編碼。位置編碼是一種將序列中每個位置的位置信息嵌入到向量表示中的技術，以便Transformer可以更好地處理長序列。

在Transformer中，位置編碼是通過將位置信息添加到單詞的嵌入向量中實現的。位置編碼向量會在嵌入向量的維度上進行加法操作，以反映詞在序列中的位置。具體來說，位置編碼是根據以下公式計算的：

Multi-Attention

建議參考李教授的說明https://www.youtube.com/watch?v=gmsMY5kc-zw&pp=ygUSbXVsdGloZWFkYXR0ZW50aW9u

Single Head Attention 單一個Query1的情況如下

Multi Head Attention，將上述的Single Head Attention重複Query i 次，並將Ouput Concate

Johnny Chang

Written by Johnny Chang

Linkedin: https://www.linkedin.com/in/cheng-chang-7b592586/

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams