top of page

Transformer Models & BERT Model: Overview

  • Writer: Archishman Bandyopadhyay
    Archishman Bandyopadhyay
  • Jun 19, 2023
  • 5 min read

Presenter: Sanjana Reddy, Machine Learning Engineer at Google's Advanced Solutions Lab


ree

1. Evolution of Language Modeling

ree

  1. Word2Vec and N-grams (2013): Word2Vec represents text using dense vectors, where each word is assigned a unique vector in a continuous space. N-grams are fixed-size tuples of words that appear consecutively in a document. Both techniques help in capturing word relationships and context within a text.

  2. RNNs and LSTMs (2014): Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks are designed to handle sequential data and have shown improved performance on NLP tasks like machine translation and text classification. They maintain a hidden state that can capture information from previous time steps, thus capturing the context of the input data.

  3. Attention Mechanisms (2015): Attention mechanisms allow models to focus on specific parts of the input when generating the output. This has led to the development of more advanced models like Transformers and BERT, which have demonstrated state-of-the-art performance on a wide range of NLP tasks.

2. Transformers

  • Paper: "Attention is All You Need" (2017) by Vaswani et al.

  • Advantages: Context-aware word representations, parallelization, and improved performance in machine translation due to the attention mechanism that allows the model to selectively focus on different parts of the input.

Although all the models before Transformers were able to represent words as vectors,

these vectors did not contain the context and the usage of words changes based on the context.

For example, bank in river bank versus bank in bank robber might have the same vector

representation before attention mechanisms came about.

A transformer is an encoder-decoder model that uses the attention mechanism.

ree

Transformer Architecture

ree
  • Encoder-decoder model: The Transformer consists of an encoder, which processes the input sequence and creates a representation, and a decoder, which generates the output based on the encoder's representation.

  • Encoder: The encoder is composed of a stack of identical layers (usually six - not a magical number, just a hyperparameter) with self-attention and feedforward sub-layers. It encodes the input sequence and passes it to the decoder.

  • Decoder: The decoder is also composed of a stack of identical layers (usually six) with self-attention, encoder-decoder attention, and feedforward sub-layers. It decodes the representation from the encoder for a specific task (e.g., translation).


Operating steps of Transformers

  • Input Embedding: Convert the input sentence (natural language) into word embeddings for each word in the sentence.

ree
ree
  • Self-Attention Layer:

    • Break up the input embeddings into query, key, and value vectors using learned weights.

ree
ree
  • Compute the attention scores by multiplying the query and key vectors.

  • Apply the softmax function to the attention scores.

  • Multiply the value vectors by the softmax scores to obtain weighted value vectors.

ree
  • Sum up the weighted value vectors to produce the output of the self-attention layer for each position.

ree

  • Feedforward Layer: Process the output of the self-attention layer through a feedforward neural network separately for each position.

  • Multi-Headed Attention: Repeat the self-attention process eight times (in this case) to perform multi-headed attention.

    • Multiply the input embeddings by their respective weighted matrices.

    • Calculate the attention using the resulting query, key, and value matrices.

ree
  • Output Matrix: Concatenate the matrices from the multi-headed attention step and produce the output matrix, which has the same dimensions as the initial input matrix.


ree

3. BERT (Bidirectional Encoder Representations from Transformers)

ree
  • Developed by Google in 2018: BERT is a Transformer-based model that has shown state-of-the-art performance on a wide range of NLP tasks. Today, Bert Powers Google Search.

ree
  • Encoder-only architecture: BERT is an encoder-only model, which means it does not use the decoder component of the original Transformer architecture. This makes it more suitable for tasks that require understanding the input text rather than generating new text.

  • Variations: There are two main variations of BERT - BERT Base (12 layers, 110 million parameters) and BERT Large (24 layers, 340 million parameters). Both models have been pre-trained on large text corpora (Wikipedia and Books Corpus) and can be fine-tuned for specific tasks.

ree

BERT Training Tasks

  • Masked Language Model: During pre-training, BERT learns to predict masked words in a sentence. A certain percentage of the words (usually 15%) are masked, and the model is trained to predict these masked words based on the context provided by the unmasked words.

ree
  • Next Sentence Prediction: BERT is also trained to predict whether a given sentence is the next sentence following a given input sentence. This helps BERT learn relationships between sentences and improves its understanding of sentence-level context.

ree

BERT Embeddings

  • Token embeddings: Token embeddings are dense vector representations of each token (word or subword) in the input sentence. These embeddings capture semantic information about the tokens and are used as input to the BERT model.

  • Segment embeddings: Segment embeddings are used to distinguish between different input sequences when BERT processes pairs of sentences (e.g., for sentence pair classification). The SEP token is used to separate the two input sequences.

  • Position embeddings: Position embeddings encode the position of each token in the input sequence. This allows BERT to learn the orderof the input tokens and incorporate this information when processing the input.

ree

4. BERT for Downstream Tasks

  • BERT can be fine-tuned for various NLP tasks, including:

    1. Single sentence classification: BERT can be used to classify sentences into categories based on the text. For example, sentiment analysis, where a sentence is classified as positive, negative, or neutral.

    2. Sentence pair classification: BERT can be fine-tuned to determine the relationship between two sentences, such as textual entailment or semantic similarity.

    3. Question answering: BERT can be used in question-answering systems to find the most relevant answer to a given question within a given context. For example, the model can be fine-tuned on the SQuAD (Stanford Question Answering Dataset) benchmark to extract answers from passages.

    4. Single sentence tagging tasks: BERT can be used for tasks that involve tagging individual tokens in a sentence, such as named entity recognition (NER) or part-of-speech (POS) tagging.

ree


Fine-tuning BERT

  • To fine-tune BERT for a specific task, a task-specific output layer is added to the pre-trained BERT model. The entire model, including the pre-trained layers and the new output layer, is then trained on the task-specific dataset.

  • Fine-tuning typically requires much less data and training time than training a model from scratch, as the pre-trained BERT model already encodes a significant amount of language understanding.

Pre-trained BERT Models and Tokenizers

  • Pre-trained BERT models and tokenizers are available in popular NLP libraries like Hugging Face's Transformers library. These pre-trained models provide a starting point for fine-tuning on specific tasks and can save considerable time and resources compared to training a model from scratch.

Transfer Learning in BERT

  • BERT leverages transfer learning, a technique where a model pre-trained on one task is fine-tuned for another task, to achieve state-of-the-art performance on a wide range of NLP tasks. The pre-training phase allows BERT to learn general language understanding, which can then be adapted to specific tasks with fine-tuning. This approach has proven to be more effective than training separate models for each task.


5. Conclusion

  • Transformers and BERT models have revolutionized NLP and have become the foundation for many state-of-the-art models in various language tasks. By leveraging attention mechanisms, parallelization, and transfer learning, these models have greatly improved the performance and capabilities of NLP systems, making them an essential tool for researchers and practitioners alike.


Python Notebook for Text Classification with BERT

Check out the complete video lecture here :



Did you find this useful ?

  • Yes

  • No


 
 
 

Comments


Drop Me a Line, Let Me Know What You Think

Thanks for submitting!

© 2035 by Train of Thoughts. Powered and secured by Wix

bottom of page