BERT Demo#

Loading Libraries#

[1]:
import tensorflow_hub as hub
import tensorflow_text as text

Loading Urls#

encoder and preprocessing url

[2]:
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
preprocessing_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

Loading BERT Preprocessor#

[3]:
preprocessor = hub.KerasLayer(preprocessing_url)

testing preprocessor#

[5]:
test_text = [
    "this is first document",
    "I love pasta"
]

test_dict = preprocessor(test_text)
test_dict.keys()
[5]:
dict_keys(['input_word_ids', 'input_mask', 'input_type_ids'])

Preprocessor returns a dictionary with 3 values.

  • input_word_ids : has the token ids of the input sequences.

  • input_mask : has value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.

  • input_type_ids : has the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second segment (index 1, if present) includes its end-of-segment token. Padding tokens get index 0 again.

every sentence/row has prefix CLS and suffix SEP.

and total size of vector for 1 row will be 128.

[9]:
test_dict['input_word_ids'].shape
[9]:
TensorShape([2, 128])

CLS this is first document SEP

[13]:
test_dict['input_word_ids'][0]
[13]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([ 101, 2023, 2003, 2034, 6254,  102,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0], dtype=int32)>
[14]:
test_dict['input_mask'].shape

[14]:
TensorShape([2, 128])
[15]:
test_dict['input_mask'][0]

[15]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>
[16]:
test_dict['input_type_ids'].shape
[16]:
TensorShape([2, 128])
[20]:
test_dict['input_type_ids'][0]

[20]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>

Loading Encoder#

[26]:
encoder_model = hub.KerasLayer(encoder_url)

testing encoder#

[28]:
encoded_dict = encoder_model(test_dict)
[29]:
encoded_dict.keys()
[29]:
dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])
[30]:
encoded_dict['pooled_output']
[30]:
<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8564542 , -0.20264098,  0.44491994, ...,  0.18142326,
        -0.52242935,  0.839577  ],
       [-0.8231588 , -0.19543926,  0.4399575 , ...,  0.3558046 ,
        -0.5919596 ,  0.8541847 ]], dtype=float32)>

Pooled output gives embedding for the sentences/ rows. we have 2 sentences in input text and two pooled outputs representing the embedding.

[ ]:
encoded_dict['sequence_output'].shape
<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[-2.23829597e-01,  2.58546323e-01,  1.60378978e-01, ...,
         -1.75076678e-01,  2.57635832e-01,  3.58420372e-01],
        [-5.81663609e-01, -2.24008322e-01,  6.07548207e-02, ...,
         -4.35946345e-01,  7.22116292e-01,  2.05105364e-01],
        [-5.88114560e-01, -3.27617466e-01,  6.20402157e-01, ...,
         -4.78888750e-01,  5.28291821e-01,  6.42452359e-01],
        ...,
        [-2.36085236e-01,  1.05683051e-01,  5.52026212e-01, ...,
          1.31120771e-01,  4.79894549e-01,  3.75098825e-01],
        [-2.66874582e-01,  8.43454301e-02,  5.48812032e-01, ...,
          1.94243729e-01,  4.29042369e-01,  3.45293581e-01],
        [-3.05021435e-01,  4.44966406e-02,  5.43690860e-01, ...,
          2.48683482e-01,  4.10975337e-01,  2.66701370e-01]],

       [[ 8.02702308e-02,  2.39341095e-01,  6.48294538e-02, ...,
         -2.08112523e-01,  1.62344888e-01,  2.63515770e-01],
        [ 1.01238646e-01,  2.56397724e-01,  1.76448345e-01, ...,
         -3.21156919e-01,  7.11108208e-01,  3.40716913e-02],
        [ 1.00183558e+00,  7.70643294e-01,  6.90153003e-01, ...,
         -2.05520004e-01,  4.99139577e-01, -4.82482314e-02],
        ...,
        [ 1.50605410e-01,  2.65507936e-01,  6.01852715e-01, ...,
          2.56534293e-03, -7.26580620e-04,  2.05264628e-01],
        [ 7.44732320e-02,  1.70520827e-01,  4.65080142e-01, ...,
          6.46677464e-02, -2.52081901e-02,  1.48678273e-01],
        [-3.44444752e-01, -7.48260766e-02,  3.19036841e-01, ...,
          2.54593313e-01, -5.94146326e-02,  1.00927249e-01]]],
      dtype=float32)>
[35]:
enc_output = encoded_dict['encoder_outputs']
[36]:
len(enc_output)
[36]:
12

As we are using BERT small and it has 12 encoder layers. and the outputs mentioned in this is output of each individual encoder layer.

[38]:
enc_output[0].shape

[38]:
TensorShape([2, 128, 768])

as we have 2 sentences, so first shape is 2, then 128 is padding and 768 sized embedding.

[39]:
enc_output[-1] == encoded_dict['sequence_output']

[39]:
<tf.Tensor: shape=(2, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])>

so the last encoder output is actually sequence output.. Obviously.

[41]:
encoded_dict['default'] == encoded_dict['pooled_output']
[41]:
<tf.Tensor: shape=(2, 768), dtype=bool, numpy=
array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])>