BERT Demo#
References#
https://www.tensorflow.org/hub
https://tfhub.dev/google/collections/bert/1
https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4
Loading Libraries#
[1]:
import tensorflow_hub as hub
import tensorflow_text as text
Loading Urls#
encoder and preprocessing url
[2]:
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
preprocessing_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
Loading BERT Preprocessor#
[3]:
preprocessor = hub.KerasLayer(preprocessing_url)
testing preprocessor#
[5]:
test_text = [
"this is first document",
"I love pasta"
]
test_dict = preprocessor(test_text)
test_dict.keys()
[5]:
dict_keys(['input_word_ids', 'input_mask', 'input_type_ids'])
Preprocessor returns a dictionary with 3 values.
input_word_ids : has the token ids of the input sequences.
input_mask : has value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.
input_type_ids : has the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second segment (index 1, if present) includes its end-of-segment token. Padding tokens get index 0 again.
every sentence/row has prefix CLS
and suffix SEP
.
and total size of vector for 1 row will be 128
.
[9]:
test_dict['input_word_ids'].shape
[9]:
TensorShape([2, 128])
CLS this is first document SEP
[13]:
test_dict['input_word_ids'][0]
[13]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([ 101, 2023, 2003, 2034, 6254, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0], dtype=int32)>
[14]:
test_dict['input_mask'].shape
[14]:
TensorShape([2, 128])
[15]:
test_dict['input_mask'][0]
[15]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>
[16]:
test_dict['input_type_ids'].shape
[16]:
TensorShape([2, 128])
[20]:
test_dict['input_type_ids'][0]
[20]:
<tf.Tensor: shape=(128,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>
Loading Encoder#
[26]:
encoder_model = hub.KerasLayer(encoder_url)
testing encoder#
[28]:
encoded_dict = encoder_model(test_dict)
[29]:
encoded_dict.keys()
[29]:
dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])
[30]:
encoded_dict['pooled_output']
[30]:
<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8564542 , -0.20264098, 0.44491994, ..., 0.18142326,
-0.52242935, 0.839577 ],
[-0.8231588 , -0.19543926, 0.4399575 , ..., 0.3558046 ,
-0.5919596 , 0.8541847 ]], dtype=float32)>
Pooled output gives embedding for the sentences/ rows. we have 2 sentences in input text and two pooled outputs representing the embedding.
[ ]:
encoded_dict['sequence_output'].shape
<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[-2.23829597e-01, 2.58546323e-01, 1.60378978e-01, ...,
-1.75076678e-01, 2.57635832e-01, 3.58420372e-01],
[-5.81663609e-01, -2.24008322e-01, 6.07548207e-02, ...,
-4.35946345e-01, 7.22116292e-01, 2.05105364e-01],
[-5.88114560e-01, -3.27617466e-01, 6.20402157e-01, ...,
-4.78888750e-01, 5.28291821e-01, 6.42452359e-01],
...,
[-2.36085236e-01, 1.05683051e-01, 5.52026212e-01, ...,
1.31120771e-01, 4.79894549e-01, 3.75098825e-01],
[-2.66874582e-01, 8.43454301e-02, 5.48812032e-01, ...,
1.94243729e-01, 4.29042369e-01, 3.45293581e-01],
[-3.05021435e-01, 4.44966406e-02, 5.43690860e-01, ...,
2.48683482e-01, 4.10975337e-01, 2.66701370e-01]],
[[ 8.02702308e-02, 2.39341095e-01, 6.48294538e-02, ...,
-2.08112523e-01, 1.62344888e-01, 2.63515770e-01],
[ 1.01238646e-01, 2.56397724e-01, 1.76448345e-01, ...,
-3.21156919e-01, 7.11108208e-01, 3.40716913e-02],
[ 1.00183558e+00, 7.70643294e-01, 6.90153003e-01, ...,
-2.05520004e-01, 4.99139577e-01, -4.82482314e-02],
...,
[ 1.50605410e-01, 2.65507936e-01, 6.01852715e-01, ...,
2.56534293e-03, -7.26580620e-04, 2.05264628e-01],
[ 7.44732320e-02, 1.70520827e-01, 4.65080142e-01, ...,
6.46677464e-02, -2.52081901e-02, 1.48678273e-01],
[-3.44444752e-01, -7.48260766e-02, 3.19036841e-01, ...,
2.54593313e-01, -5.94146326e-02, 1.00927249e-01]]],
dtype=float32)>
[35]:
enc_output = encoded_dict['encoder_outputs']
[36]:
len(enc_output)
[36]:
12
As we are using BERT small and it has 12 encoder layers. and the outputs mentioned in this is output of each individual encoder layer.
[38]:
enc_output[0].shape
[38]:
TensorShape([2, 128, 768])
as we have 2 sentences, so first shape is 2, then 128 is padding and 768 sized embedding.
[39]:
enc_output[-1] == encoded_dict['sequence_output']
[39]:
<tf.Tensor: shape=(2, 128, 768), dtype=bool, numpy=
array([[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]],
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])>
so the last encoder output is actually sequence output.. Obviously.
[41]:
encoded_dict['default'] == encoded_dict['pooled_output']
[41]:
<tf.Tensor: shape=(2, 768), dtype=bool, numpy=
array([[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]])>