Transformer Models & BERT Model: Overview

Archishman Bandyopadhyay
Jun 19, 2023
5 min read

Presenter: Sanjana Reddy, Machine Learning Engineer at Google's Advanced Solutions Lab

1. Evolution of Language Modeling

Word2Vec and N-grams (2013): Word2Vec represents text using dense vectors, where each word is assigned a unique vector in a continuous space. N-grams are fixed-size tuples of words that appear consecutively in a document. Both techniques help in capturing word relationships and context within a text.
RNNs and LSTMs (2014): Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks are designed to handle sequential data and have shown improved performance on NLP tasks like machine translation and text classification. They maintain a hidden state that can capture information from previous time steps, thus capturing the context of the input data.
Attention Mechanisms (2015): Attention mechanisms allow models to focus on specific parts of the input when generating the output. This has led to the development of more advanced models like Transformers and BERT, which have demonstrated state-of-the-art performance on a wide range of NLP tasks.

2. Transformers

Paper: "Attention is All You Need" (2017) by Vaswani et al.
Advantages: Context-aware word representations, parallelization, and improved performance in machine translation due to the attention mechanism that allows the model to selectively focus on different parts of the input.

Although all the models before Transformers were able to represent words as vectors,

these vectors did not contain the context and the usage of words changes based on the context.

For example, bank in river bank versus bank in bank robber might have the same vector

representation before attention mechanisms came about.

A transformer is an encoder-decoder model that uses the attention mechanism.

Transformer Architecture

Encoder-decoder model: The Transformer consists of an encoder, which processes the input sequence and creates a representation, and a decoder, which generates the output based on the encoder's representation.
Encoder: The encoder is composed of a stack of identical layers (usually six - not a magical number, just a hyperparameter) with self-attention and feedforward sub-layers. It encodes the input sequence and passes it to the decoder.
Decoder: The decoder is also composed of a stack of identical layers (usually six) with self-attention, encoder-decoder attention, and feedforward sub-layers. It decodes the representation from the encoder for a specific task (e.g., translation).

Operating steps of Transformers

Input Embedding: Convert the input sentence (natural language) into word embeddings for each word in the sentence.

Self-Attention Layer:
- Break up the input embeddings into query, key, and value vectors using learned weights.

Compute the attention scores by multiplying the query and key vectors.
Apply the softmax function to the attention scores.
Multiply the value vectors by the softmax scores to obtain weighted value vectors.

Sum up the weighted value vectors to produce the output of the self-attention layer for each position.

Feedforward Layer: Process the output of the self-attention layer through a feedforward neural network separately for each position.
Multi-Headed Attention: Repeat the self-attention process eight times (in this case) to perform multi-headed attention.
- Multiply the input embeddings by their respective weighted matrices.
- Calculate the attention using the resulting query, key, and value matrices.

Output Matrix: Concatenate the matrices from the multi-headed attention step and produce the output matrix, which has the same dimensions as the initial input matrix.

3. BERT (Bidirectional Encoder Representations from Transformers)

Developed by Google in 2018: BERT is a Transformer-based model that has shown state-of-the-art performance on a wide range of NLP tasks. Today, Bert Powers Google Search.

Encoder-only architecture: BERT is an encoder-only model, which means it does not use the decoder component of the original Transformer architecture. This makes it more suitable for tasks that require understanding the input text rather than generating new text.
Variations: There are two main variations of BERT - BERT Base (12 layers, 110 million parameters) and BERT Large (24 layers, 340 million parameters). Both models have been pre-trained on large text corpora (Wikipedia and Books Corpus) and can be fine-tuned for specific tasks.

BERT Training Tasks

Masked Language Model: During pre-training, BERT learns to predict masked words in a sentence. A certain percentage of the words (usually 15%) are masked, and the model is trained to predict these masked words based on the context provided by the unmasked words.

Next Sentence Prediction: BERT is also trained to predict whether a given sentence is the next sentence following a given input sentence. This helps BERT learn relationships between sentences and improves its understanding of sentence-level context.

BERT Embeddings

Token embeddings: Token embeddings are dense vector representations of each token (word or subword) in the input sentence. These embeddings capture semantic information about the tokens and are used as input to the BERT model.
Segment embeddings: Segment embeddings are used to distinguish between different input sequences when BERT processes pairs of sentences (e.g., for sentence pair classification). The SEP token is used to separate the two input sequences.
Position embeddings: Position embeddings encode the position of each token in the input sequence. This allows BERT to learn the orderof the input tokens and incorporate this information when processing the input.

4. BERT for Downstream Tasks

BERT can be fine-tuned for various NLP tasks, including:
1. Single sentence classification: BERT can be used to classify sentences into categories based on the text. For example, sentiment analysis, where a sentence is classified as positive, negative, or neutral.
2. Sentence pair classification: BERT can be fine-tuned to determine the relationship between two sentences, such as textual entailment or semantic similarity.
3. Question answering: BERT can be used in question-answering systems to find the most relevant answer to a given question within a given context. For example, the model can be fine-tuned on the SQuAD (Stanford Question Answering Dataset) benchmark to extract answers from passages.
4. Single sentence tagging tasks: BERT can be used for tasks that involve tagging individual tokens in a sentence, such as named entity recognition (NER) or part-of-speech (POS) tagging.

Fine-tuning BERT

To fine-tune BERT for a specific task, a task-specific output layer is added to the pre-trained BERT model. The entire model, including the pre-trained layers and the new output layer, is then trained on the task-specific dataset.
Fine-tuning typically requires much less data and training time than training a model from scratch, as the pre-trained BERT model already encodes a significant amount of language understanding.

Pre-trained BERT Models and Tokenizers

Pre-trained BERT models and tokenizers are available in popular NLP libraries like Hugging Face's Transformers library. These pre-trained models provide a starting point for fine-tuning on specific tasks and can save considerable time and resources compared to training a model from scratch.

Transfer Learning in BERT

BERT leverages transfer learning, a technique where a model pre-trained on one task is fine-tuned for another task, to achieve state-of-the-art performance on a wide range of NLP tasks. The pre-training phase allows BERT to learn general language understanding, which can then be adapted to specific tasks with fine-tuning. This approach has proven to be more effective than training separate models for each task.

5. Conclusion

Transformers and BERT models have revolutionized NLP and have become the foundation for many state-of-the-art models in various language tasks. By leveraging attention mechanisms, parallelization, and transfer learning, these models have greatly improved the performance and capabilities of NLP systems, making them an essential tool for researchers and practitioners alike.

Python Notebook for Text Classification with BERT

Check out the complete video lecture here :

Did you find this useful ?