Corpus#
[1]:
import nltk
Corpus is latin for ‘body’, plural is Corpora.
Text Corpora /Text Corpus#
Gutenberg Corpus
Web and Chat Text
Brown Corpus
Reuters Corpus
Inaugural Address Corpus
Annotated Text Corpora
>>> nltk.download('punkt')
>>> nltk.download('book')
will be downloaded in the user directory
Gutenberg Corpus#
NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at.
Brown#
The Brown Corpus was the first million-word electronic corpus of English,
Inaugural#
US presidential speeches
Popular Text Corpora#
stopwords : Collection of stop words. reuters : Collection of news articles. cmudict : Collection of CMU Dictionary words. movie_reviews : Collection of Movie Reviews. np_chat : Collection of chat text. names : Collection of names associated with males and females. state_union : Collection of state union address. wordnet : Collection of all lexical entries. words : Collection of words in Wordlist corpus.
Text Corpus Structure#
A text corpus is organized into any of the following four structures.
Isolated - Holds Individual text collections. Categorized - Each text collection tagged to a category. Overlapping - Each text collection tagged to one or more categories, and Temporal - Each text collection tagged to a period, date, time, etc.
[2]:
from nltk.corpus import genesis
genesis.fileids()
[2]:
['english-kjv.txt',
'english-web.txt',
'finnish.txt',
'french.txt',
'german.txt',
'lolcat.txt',
'portuguese.txt',
'swedish.txt']
[3]:
from prettytable import PrettyTable
x = PrettyTable()
x.field_names = ["average word length","average sentence length","fileids"]
for fileid in genesis.fileids():
n_chars = len(genesis.raw(fileid))
n_words = len(genesis.words(fileid))
n_sents = len(genesis.sents(fileid))
x.add_row([int(n_chars/n_words), int(n_words/n_sents), fileid])
print(x)
+---------------------+-------------------------+-----------------+
| average word length | average sentence length | fileids |
+---------------------+-------------------------+-----------------+
| 4 | 30 | english-kjv.txt |
| 4 | 19 | english-web.txt |
| 5 | 15 | finnish.txt |
| 4 | 23 | french.txt |
| 4 | 23 | german.txt |
| 4 | 20 | lolcat.txt |
| 4 | 27 | portuguese.txt |
| 4 | 30 | swedish.txt |
+---------------------+-------------------------+-----------------+
[4]:
from nltk.corpus import inaugural
int(len(inaugural.words('1789-Washington.txt')) / len(set(inaugural.words('1789-Washington.txt'))))
[4]:
2