Count Based Vectorizer#

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import (
    TfidfVectorizer, CountVectorizer, TfidfTransformer
)

Count Vectorizer#

frequency based on the set of words

[2]:
corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one'
]
[3]:
count_vec = CountVectorizer()
count_vec.fit(corpus)
[3]:
CountVectorizer()
[4]:
count_vec.get_feature_names()
[4]:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[5]:
count_data = count_vec.transform(corpus).toarray()
count_data
[5]:
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])
[6]:
pd.DataFrame(
    data=count_data,
    columns=count_vec.get_feature_names()
)
[6]:
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1

So the first line is this is the first document. and then the first vector is reflecting as document is 1, first is 1, is is 1 etc.

Term Frequency Transformer#

Number of times term t, appear in the document d.

:nbsphinx-math:`begin{align}

tf_{t,d} &= frac{n_{t,d}}{sum_k{n_{k,d}}}

end{align}`

[7]:
tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(count_data)
[7]:
TfidfTransformer(use_idf=False)
[8]:
tf_data = tf_transformer.transform(count_data).toarray()
tf_data
[8]:
array([[0.        , 0.4472136 , 0.4472136 , 0.4472136 , 0.        ,
        0.        , 0.4472136 , 0.        , 0.4472136 ],
       [0.        , 0.70710678, 0.        , 0.35355339, 0.        ,
        0.35355339, 0.35355339, 0.        , 0.35355339],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.        , 0.40824829, 0.40824829, 0.40824829]])
[9]:
pd.DataFrame(
    data=tf_data,
    columns=count_vec.get_feature_names()
)
[9]:
and document first is one second the third this
0 0.000000 0.447214 0.447214 0.447214 0.000000 0.000000 0.447214 0.000000 0.447214
1 0.000000 0.707107 0.000000 0.353553 0.000000 0.353553 0.353553 0.000000 0.353553
2 0.408248 0.000000 0.000000 0.408248 0.408248 0.000000 0.408248 0.408248 0.408248

Inverse Document Frequency Transformer#

Document Frequency : number of ducuments that the term appears / number of ducuments

:nbsphinx-math:`begin{align}

probability = df_{t,d,D} &= frac{d in D : t in d}{D}

end{align}`

The log of the number of documents D divided by the number of documents that contain the word t. Inverse data frequency determines the weight of rare words across all documents in the corpus.

:nbsphinx-math:`begin{align}

idf &= -log(p)\ idf &= log(frac{1}{p})\ idf_{t,d,D} &= log(frac{D}{d in D : t in d})

end{align}`

document d is in all the documents D, and term t is in the document d.

[10]:
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(count_data)
[10]:
TfidfTransformer()
[11]:
tfidf_data = tfidf_transformer.transform(count_data).toarray()
tfidf_data
[11]:
array([[0.        , 0.46941728, 0.61722732, 0.3645444 , 0.        ,
        0.        , 0.3645444 , 0.        , 0.3645444 ],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])
[12]:
pd.DataFrame(
    data=tfidf_data,
    columns=count_vec.get_feature_names()
)
[12]:
and document first is one second the third this
0 0.00000 0.469417 0.617227 0.364544 0.00000 0.000000 0.364544 0.00000 0.364544
1 0.00000 0.728445 0.000000 0.282851 0.00000 0.478909 0.282851 0.00000 0.282851
2 0.49712 0.000000 0.000000 0.293607 0.49712 0.000000 0.293607 0.49712 0.293607

TFIDF Vectorizer#

Count Vectorizer –> Tfidf Transformer = Tfidf Vectorizer

\(tfidf_{t,d,D} = tf_{t,d}.idf_{t,d,D}\)

[13]:
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(corpus)
[13]:
TfidfVectorizer()
[14]:
tfidf_data = tfidf_vec.transform(corpus).toarray()
[15]:
tfidf_vec.get_feature_names()
[15]:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[16]:
pd.DataFrame(
    data=tfidf_data,
    columns=tfidf_vec.get_feature_names()
)
[16]:
and document first is one second the third this
0 0.00000 0.469417 0.617227 0.364544 0.00000 0.000000 0.364544 0.00000 0.364544
1 0.00000 0.728445 0.000000 0.282851 0.00000 0.478909 0.282851 0.00000 0.282851
2 0.49712 0.000000 0.000000 0.293607 0.49712 0.000000 0.293607 0.49712 0.293607