BERT Transformer#

Bidirectional Encoder Representations from Transformers (BERT)

Issue with Word2Vec, In these two sentences the word ‘fair’ has different meaning.

  • It was not a fair game.

  • The town fair was really fun.

But word2vec will only generate a single vector for scenarios. There is no context.

This contextualization meaning of the word will be provided by BERT.

The original English-language BERT has two models:

  1. the BERTBASE: 12 encoders with 12 bidirectional self-attention heads

  2. the BERTLARGE: 24 encoders with 16 bidirectional self-attention heads.

Both models are pre-trained from unlabeled data extracted from the BooksCorpus with 800M words and English Wikipedia with 2,500M words.

BERT was pretrained on two tasks 1. language modelling (15% of tokens were masked(removed) and BERT was trained to predict them from context) 2. next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence).

As a result of the training process, BERT learns contextual embeddings for words. After pretraining, which is computationally expensive, BERT can be finetuned with less resources on smaller datasets to optimize its performance on specific tasks.