Count Based Vectorizer

References#

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Count Vectorizer#

frequency based on the set of words

[2]:

corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one'
]

[3]:

count_vec = CountVectorizer()
count_vec.fit(corpus)

[3]:

CountVectorizer()

[4]:

count_vec.get_feature_names()

[4]:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[5]:

count_data = count_vec.transform(corpus).toarray()
count_data

[5]:

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

[6]:

pd.DataFrame(
    data=count_data,
    columns=count_vec.get_feature_names()
)

[6]:

	and	document	first	is	one	second	the	third	this
0	0	1	1	1	0	0	1	0	1
1	0	2	0	1	0	1	1	0	1
2	1	0	0	1	1	0	1	1	1

So the first line is this is the first document. and then the first vector is reflecting as document is 1, first is 1, is is 1 etc.

Term Frequency Transformer#

Number of times term t, appear in the document d.

:nbsphinx-math:`begin{align}: tf_{t,d} &= frac{n_{t,d}}{sum_k{n_{k,d}}}

end{align}`

[7]:

tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(count_data)

[7]:

TfidfTransformer(use_idf=False)

[8]:

tf_data = tf_transformer.transform(count_data).toarray()
tf_data

[8]:

array([[0.        , 0.4472136 , 0.4472136 , 0.4472136 , 0.        ,
        0.        , 0.4472136 , 0.        , 0.4472136 ],
       [0.        , 0.70710678, 0.        , 0.35355339, 0.        ,
        0.35355339, 0.35355339, 0.        , 0.35355339],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.        , 0.40824829, 0.40824829, 0.40824829]])

[9]:

pd.DataFrame(
    data=tf_data,
    columns=count_vec.get_feature_names()
)

[9]:

	and	document	first	is	one	second	the	third	this
0	0.000000	0.447214	0.447214	0.447214	0.000000	0.000000	0.447214	0.000000	0.447214
1	0.000000	0.707107	0.000000	0.353553	0.000000	0.353553	0.353553	0.000000	0.353553
2	0.408248	0.000000	0.000000	0.408248	0.408248	0.000000	0.408248	0.408248	0.408248

Inverse Document Frequency Transformer#

Document Frequency : number of ducuments that the term appears / number of ducuments

:nbsphinx-math:`begin{align}: probability = df_{t,d,D} &= frac{d in D : t in d}{D}

end{align}`

The log of the number of documents D divided by the number of documents that contain the word t. Inverse data frequency determines the weight of rare words across all documents in the corpus.

:nbsphinx-math:`begin{align}: idf &= -log(p)\ idf &= log(frac{1}{p})\ idf_{t,d,D} &= log(frac{D}{d in D : t in d})

end{align}`

document d is in all the documents D, and term t is in the document d.

[10]:

tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(count_data)

[10]:

TfidfTransformer()

[11]:

tfidf_data = tfidf_transformer.transform(count_data).toarray()
tfidf_data

[11]:

array([[0.        , 0.46941728, 0.61722732, 0.3645444 , 0.        ,
        0.        , 0.3645444 , 0.        , 0.3645444 ],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])

[12]:

pd.DataFrame(
    data=tfidf_data,
    columns=count_vec.get_feature_names()
)

[12]:

	and	document	first	is	one	second	the	third	this
0	0.00000	0.469417	0.617227	0.364544	0.00000	0.000000	0.364544	0.00000	0.364544
1	0.00000	0.728445	0.000000	0.282851	0.00000	0.478909	0.282851	0.00000	0.282851
2	0.49712	0.000000	0.000000	0.293607	0.49712	0.000000	0.293607	0.49712	0.293607

TFIDF Vectorizer#

Count Vectorizer –> Tfidf Transformer = Tfidf Vectorizer

\(tfidf_{t,d,D} = tf_{t,d}.idf_{t,d,D}\)

[13]:

tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(corpus)

[13]:

TfidfVectorizer()

[14]:

tfidf_data = tfidf_vec.transform(corpus).toarray()

[15]:

tfidf_vec.get_feature_names()

[15]:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[16]:

pd.DataFrame(
    data=tfidf_data,
    columns=tfidf_vec.get_feature_names()
)

[16]:

	and	document	first	is	one	second	the	third	this
0	0.00000	0.469417	0.617227	0.364544	0.00000	0.000000	0.364544	0.00000	0.364544
1	0.00000	0.728445	0.000000	0.282851	0.00000	0.478909	0.282851	0.00000	0.282851
2	0.49712	0.000000	0.000000	0.293607	0.49712	0.000000	0.293607	0.49712	0.293607

Contents

Count Based Vectorizer#

References#

Count Vectorizer#

Term Frequency Transformer#

Inverse Document Frequency Transformer#

TFIDF Vectorizer#