The Library Basement
Reading under ground

Tag nltk

Fun with LXXM-Corpus

Once I have a text available for natural language processing, there are a few basic tasks I like to perform to kick the tires. First, I like to run the collocations method of NLTK, which gives common word pairs from the text. For the LXXM, here are the results:

  • ἐν τῇ
  • ἐν τῷ
  • ὁ θεὸς
  • τῆς γῆς
  • καὶ εἶπεν
  • λέγει κύριος
  • ἀνὰ μέσον
  • τὴν γῆν
  • τοῦ θεοῦ
  • ὁ θεός
  • τάδε λέγει
  • πρός με
  • πάντα τὰ
  • ὁ βασιλεὺς
  • οὐ μὴ
  • οὐκ ἔστιν
  • τῇ ἡμέρᾳ
  • οἱ υἱοὶ
  • τῷ κυρίῳ
  • τοῦ βασιλέως

If you disregard the stop words, you can get a decent idea of the fundamental thematic content of the text.

Now for the silliness, using the n-gran random text generator:

ἐν ἀρχῇ ὁδοῦ πόλεως ἐπ' ὀνόμασιν φυλῶν τοῦ Ισραηλ παρώξυναν οὐκ ἐμνήσθησαν διαθήκης ἀδελφῶν καὶ ἐξαποστελῶ πῦρ ἐπὶ Μωαβ ἐν τῷ ἐξαγαγεῖν σε τὸν ἱματισμόν

A categorized, tagged Septuagint corpus

Last year I created a version of the SBLGNT for use as categorized, tagged, corpus for natural language processing. Now I have done the same with a Septuagint text. I am calling it LXXMorph-Corpus. The source for text and tags is my unicode conversion of the CATSS LXXMorph text. There is at least one category for each file.

The text is arranged with one book per file. Certain books in the source LXXMorph text are split where there is significant textual divergence (manuscript B and A, or the Old Greek and Theodotion). Each file has one or more categories (e.g. pentateuch and writings).

Since there is no punctuation in the source text, the files are laid out with one verse per line. A better arrangement from an NLP perspective would be one line per sentence (thereby preserving the semantic structure). Maybe someday we'll have a freely-licensed LXX text which will include sentence breaks.

Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
        return tag

lxx = CategorizedTaggedCorpusReader('lxxmorph-corpus/', 
    '\d{2}\..*', encoding=u'utf8',

Now through the lxx object you have access to tagged words - lxx.tagged_words(), simplified tags - lxx.tagged_words(simplify_tags=True), tagged sentences - lxx.tagged_sents(), and textual categories - lxx.words(categories='former-prophets').

This is a derivative work of the original CATSS LXXMorph text, and so your use of it is subject to the terms of that license. See the README file for more details.

A categorized, tagged Greek New Testament corpus

I have published a categorized, tagged Greek New Testament useful for natural language processing. I am calling it sblgnt-corpus. The text comes from the SBGNT and the morphological tags come from the MorphGNT project.

The text is broken up with one book per file. Each file has one or more categories (e.g. gospel and pauline). In the files there is one sentence (not verse) per line. Sentences are demarcated by punctuation . ; and ·. This makes it easy to tokenize sentences by splitting on newlines. Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here is an example:

εὐθυμεῖ/V-3PAIS τις/RI-NSM ;/;
ψαλλέτω/V-3PADS ./.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
        return tag

sblgnt = CategorizedTaggedCorpusReader('sblgnt-corpus/', 
    '\d{2}-.*', encoding=u'utf8',

Now through the sblgnt object you have access to tagged words - sblgnt.tagged_words(), simplified tags - sblgnt.tagged_words(simplify_tags=True), tagged sentences - sblgnt.tagged_sents(), and textual categories - sblgnt.words(categories='gospel').

That should be enough to kickstart the exploration of the Greek New Testament with natural language processing.

Better tokenization of the SBLGNT

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the [wordpunct_tokenize][] function. This uses the following regular expression for matching tokens:

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior:

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexp_tokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name It should be run after the script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Prep the SBLGNT for use as an NLTK corpus

The SBLGNT is available as a plain-text download, which is my personal favorite format for text processing. I have been wanting to put the SBLGNT into a Natural Language Toolkit corpus for ease in text processing for quite some time, and decided to get around to it yesterday.

First of all, the plain text of the SBLGNT has a few undesirable features for this task. First, each verse is prefixed with the verse number and the tab character, which is great for many applications but not for corpus linguistics. Second, the text contains Windows-style linebreaks and other extraneous whitespace. Third, the text contains text-critical signs.

So I wrote a script to download the plaintext archive, extract the text, and normalize it for use in NLTK.

Fir download and extract or checkout the repo. To install requirements:

$ pip install -r requirements.txt

Next, run the script:

$ python

Now you have a collection of text files, one for each book of the New Testament, in a directory called "out". You can know use these with NLTK. For example:

>>> import nltk >>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8') >>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])

You end up with sblgnt as an NLTK corpus object and sblgnt_text as an NLTK text object. You can refer to the NLTK documentation for the various uses of these. Please take note of the encodings. If you don't pay attention, you'll get lots of encoding errors when working with a unicode text and NLTK.

One thing you can do is run the collocations method on sblgnt_text:

>>> sblgnt_text.collocations() Building collocations list τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν

I'll have to look into tweaking the NLTK tokenizer, because, as you can see, it is treating elisions as tokens, which may or may not be grammatically correct (I'll have to think about that and ask around). Another cool trick, the generate method:

>>> sblgnt_text.generate(50) Building ngram index... ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,

So that's that. At some point I'll attempt to make a tagged text based on the MorphGNT (which is being re-based off SBLGNT).

Jesus' vocabulary

A friend of mine asked if I had a list of all of Jesus' words, sorted by frequency, with common words like "the" removed. I did not have such a list at hand, but I took it as a challenge.

Thanks to software, most of the work to create a sorted listed of Jesus' vocabulary is trivial. I can easily make a frequency list of his words and remove common stopwords. The most challenging part for me was finding a source of the gospels from which it was easy to extract just Jesus' words.I asked around, and found that the World English Bible XML contains a \<wj> (i.e. "words of Jesus") tag which delimits exactly what I need. So after a bit of processing, and thanks to NLTK, I was able to provide a basic list of Jesus' most common words:

  1. one - 221
  2. father - 211
  3. tell - 210
  4. man - 196
  5. God - 163
  6. things - 163
  7. come - 158
  8. son - 149
  9. go - 123
  10. also - 113
  11. know - 111
  12. may - 111
  13. kingdom - 104
  14. see - 102
  15. lord - 97
  16. said - 96
  17. therefore - 94
  18. give - 93
  19. heaven - 86

Based on the top of the list, I'd say Jesus was primarily talking about the good news.

I've shared the code.

Fun with Subtitles

The Natural Language Processing course by Dan Jurafsky and Chris Manning has commenced. The class is comprised of readings, video lectures, problems, and code examples. I have been working through the video lectures and was pleased to see that each one has English subtitles. This being a natural language processing course, I decided it was obligatory that I process these subtitles.

So I decided to download each subtitle file, concatenate them, normalize the text, tokenize the text, and remove stopwords. I take the resulting list and sort by the frequency of occurrence. Here's the top 50.

word - 113
we're - 107
one - 86
two - 78
like - 76
words - 75
distance - 65
let's - 57
gonna - 56
it's - 56
we'll - 54
that's - 49
string - 48
there's - 47
look - 46
end - 40
sentence - 40
example - 39
things - 39
inaudible - 38
might - 36
use - 36
going - 35
cost - 34
here's - 34
capital - 31
kind - 31
match - 31
algorithm - 30
see - 30
alignment - 29
could - 29
get - 29
text - 28
three - 28
e - 27
n - 27
regular - 27
different - 26
processing - 26
strings - 26
period - 25
case - 24
character - 24
language - 24
little - 24
characters - 23
means - 23
sound - 23
us - 23

At first I was thrown off by the presence of "e" and "n" in the list, thinking I had a bug in my tokenizer. But it turns out that the instructors say many individual letters in the course of their discussions. Comments or email with feedback are appreciated.

Category: language Tags: nltk

Koine Greek wordlists?

I have mostly had the Greek New Testament and related works in mind as I have been working through Natural Language Processing with Python. One useful resource discussed in the book is a word-list, which are just a simple collection of words gathered from a particular language or corpus. Text auto-completion is an example of the usefulness of such lists. The only ancient Greek wordlist included with NLTK is the Swadesh list, which is fairly limited (only 207 entries).

Before I invest the effort in creating one, does anyone know of any existing unicode Koine Greek word-lists? If not, it should not be too difficult to programmatically compile one from the Greek NT. Either way I will share what I find (and any code which happens to be written for this end).

Random Sanderson

Brandon Sanderson's Warbreakeris available online under a Creative Commons license, so I decided it would be perfect for use playing around with the Natural Language Toolkit. So here is some basic info based on the examples in the Natural Language Processing with Python book:

  • Total words: 201,388
  • Vocabulary size: 9,412
  • Lexical diversity: 21.4
  • Words longer than 7 letters appearing more than 100 times:
  • - anything - Bluefingers - Blushweaver - Hallandren - Lifeless - Lightsong - Llarimar - Nightblood - princess - Returned - servants - something - Susebron
  • Collocations:
  • - God King - Tonk Fah - [character] said (9x) - Pahn Kahl - fell silent - serving women - high priest - could see - God Kings - Either way - Iridescent Tones - someone else

And last but not least, some Random Sanderson:

Prologue It' s find out. "No," Vivenna said . "Oh, dear," Lightsong said as soon as they walked toward the royal family. Service for a few snips later, he was left chilled by how often she got used to moving with terrible speed. Vivenna turned as Denth. He took the evening off and do something? Was he chosen to sleep with the sword free of all those colors, even with awkward Commands. Blushweaver smiled. Vivenna didn't run from us."You saved."

Random Genesis

I have just begun working through Natural Language Processing with Python. One of the first features highlighted in the first chapter is the ability of nltk (the Natural Language ToolKit - a Python module) is to generate random text from a corpus.

Without further ado, here is what my system generated based on the book of Genesis in the KJV:

In the selfsame day , neither do thou any thing that creepeth upon the bank of the east wind , sprung up after th And I will send thee a covering of the Philistines unto Gerar . And he commanded the steward of my master greatly ; and she bare unto Jacob , went forth to go down and buy thee fo But if thou be in the second , and fall upon Adam , in the land is good : and his two womenservants , and begat sons and his eleven sons , and put every man ' s

Sound realistic? ;-)