GPT2 Transformer Attention Block Analysis

Analysis of GPT-2 Transformer's Attention Block.

Transformers have revolutionized natural language processing with models like GPT-2. At the core of GPT-2’s architecture lies the attention mechanism, which determines how the model processes and prioritizes different tokens in a sequence.

Dependencies

transformers 
torch

Notebook Overview

Visualization of Attention Patterns: Includes heatmaps and other visual aids to illustrate how GPT-2 attends to different tokens in input sequences.
Token Generation Process: GPT2 outputs the next token based on the preceding tokens. At each step, the attention layers dynamically compute their outputs, and highlighting how each encoded token adapts to the evolving context as it completes.

Corrections

The subheader “Object class for inferencing attn by each dimensions.” is incorrect. The correct subheader should be: “Object class for visualizing attention blocks and layers.”

Notes to reader

The attention block was analyzed purely out of curiosity and without any prior reading on the implementation details (e.g., function inputs and outputs) from the Hugging Face Transformers repository. This served as a double-edged sword, as it prolonged the time it took for me to reach conclusions and prevented me from broadening my modeling methods. Nevertheless, it allowed me to view attention mechanisms as a Padawan where the only prior knowledge was the mathematical (theoretical) viewpoint understood from papers like Attention is all you need.
View Selection: Block vs. Layer – The model's output was initially quite confusing to interpret. To further my understanding, I visualise the variations of each dimension of the outputs, leading to the following distinctions:
- Viewing by block: Extracts attention from a specific transformer block, focusing on how attention is distributed within that block.
- Viewing by layer: Extracts attention across all blocks for a specific layer, allowing for a comparative analysis of attention patterns at a particular depth of the model.

PreviousBriefing Similarity Scoring Methods of Contextual Embeddings NextCache Trails

Last updated 3 months ago