Corpus#

[1]:
import nltk

Corpus is latin for ‘body’, plural is Corpora.

Text Corpora /Text Corpus#

  • Gutenberg Corpus

  • Web and Chat Text

  • Brown Corpus

  • Reuters Corpus

  • Inaugural Address Corpus

  • Annotated Text Corpora

>>> nltk.download('punkt')
>>> nltk.download('book')

will be downloaded in the user directory

Gutenberg Corpus#

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at.

http://www.gutenberg.org/

Brown#

The Brown Corpus was the first million-word electronic corpus of English,

Inaugural#

US presidential speeches

Text Corpus Structure#

A text corpus is organized into any of the following four structures.

Isolated - Holds Individual text collections. Categorized - Each text collection tagged to a category. Overlapping - Each text collection tagged to one or more categories, and Temporal - Each text collection tagged to a period, date, time, etc.

[2]:
from nltk.corpus import genesis

genesis.fileids()
[2]:
['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']
[3]:
from prettytable import PrettyTable

x = PrettyTable()
x.field_names = ["average word length","average sentence length","fileids"]
for fileid in genesis.fileids():
    n_chars = len(genesis.raw(fileid))
    n_words = len(genesis.words(fileid))
    n_sents = len(genesis.sents(fileid))


    x.add_row([int(n_chars/n_words), int(n_words/n_sents), fileid])

print(x)
+---------------------+-------------------------+-----------------+
| average word length | average sentence length |     fileids     |
+---------------------+-------------------------+-----------------+
|          4          |            30           | english-kjv.txt |
|          4          |            19           | english-web.txt |
|          5          |            15           |   finnish.txt   |
|          4          |            23           |    french.txt   |
|          4          |            23           |    german.txt   |
|          4          |            20           |    lolcat.txt   |
|          4          |            27           |  portuguese.txt |
|          4          |            30           |   swedish.txt   |
+---------------------+-------------------------+-----------------+
[4]:
from nltk.corpus import inaugural
int(len(inaugural.words('1789-Washington.txt')) / len(set(inaugural.words('1789-Washington.txt'))))
[4]:
2