The Library Basement
Reading under ground

Category language

Better tokenization of the SBLGNT

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the [wordpunct_tokenize][] function. This uses the following regular expression for matching tokens:

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior:

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexp_tokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name It should be run after the script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Prep the SBLGNT for use as an NLTK corpus

The SBLGNT is available as a plain-text download, which is my personal favorite format for text processing. I have been wanting to put the SBLGNT into a Natural Language Toolkit corpus for ease in text processing for quite some time, and decided to get around to it yesterday.

First of all, the plain text of the SBLGNT has a few undesirable features for this task. First, each verse is prefixed with the verse number and the tab character, which is great for many applications but not for corpus linguistics. Second, the text contains Windows-style linebreaks and other extraneous whitespace. Third, the text contains text-critical signs.

So I wrote a script to download the plaintext archive, extract the text, and normalize it for use in NLTK.

Fir download and extract or checkout the repo. To install requirements:

$ pip install -r requirements.txt

Next, run the script:

$ python

Now you have a collection of text files, one for each book of the New Testament, in a directory called "out". You can know use these with NLTK. For example:

>>> import nltk >>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8') >>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])

You end up with sblgnt as an NLTK corpus object and sblgnt_text as an NLTK text object. You can refer to the NLTK documentation for the various uses of these. Please take note of the encodings. If you don't pay attention, you'll get lots of encoding errors when working with a unicode text and NLTK.

One thing you can do is run the collocations method on sblgnt_text:

>>> sblgnt_text.collocations() Building collocations list τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν

I'll have to look into tweaking the NLTK tokenizer, because, as you can see, it is treating elisions as tokens, which may or may not be grammatically correct (I'll have to think about that and ask around). Another cool trick, the generate method:

>>> sblgnt_text.generate(50) Building ngram index... ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,

So that's that. At some point I'll attempt to make a tagged text based on the MorphGNT (which is being re-based off SBLGNT).

A reader's Christmas

Loeb SophoclesI was given the Loeb volume including Sophocles' Antigone, The Women of Trachis, Philoctetes, and Oedipus at Colonus. I am very excited to have some new Greek literature to read in the new year. In addition, this is my first Loeb volume, which feels pretty nice. I remember looking longingly at the many shelves of green volumes in the library basement in college. Wish fulfilled.

I'll be sure to share insights and curiosities from these texts as I come across them in the reading process. Soon I'll develop a reading schedule to get through them in a reasonable amount of time.
Category: language Tags: Greek

Greek grammar in Greek

William Annis' twin sites Aoidoiand Scholiastae are great resources for annotated Greek poetry and prose, respectively. I have not delved too much into the offerings, but what I have seen is impressive.

My favorite resource on Scholiastae is Greek Grammar in Greek (PDF), which provides vocabulary and some introduction for the discussion of the Greek language in Greek. I appreciated this so much that I made a Quisition flashcard deck of these terms.

Category: language Tags: Greek

The lack of adoption of Unicode in Biblical studies

It is 2012, people. It is unacceptable to not use Unicode Greek and Hebrew in your publications. If you are still resorting to transliteration (PDF warning), you need to get that fixed. There are many resources on the web regarding how to get started with Unicode Greek and Hebrew. I am willing to help myself.

(The linked article is a great review of interpretations of "baptism for the dead" in 1 Corinthians 15 by Joel R. White. I recommend it. I am not sure whether or not it was the author or editor who was the source of the transliteration.)

Bundled wordplay

I ran across this enchanting line in Sirach 21:9 in the LXX:

στιππύον συνηγμένον συναγωγὴ ἀνὀμων

The NETS renders this as "a gathering of the lawless is bundled flax." (Followed by "and a flame of fire is their end.") I was struck by the use of the same lexical root συνἀγω for the subject and the adjective of the object, which makes for a great wordplay. To bring that out in English, you would have to do something more like "a gathering of the lawless is gathered flax", which does not quite sound right, unfortunately.

The history of Greek grammars

Mike Aubrey is working on a series of posts detailing the history of Greek grammars. The long outline already has five posts in addition to an ever-expanding bibliography. As someone who took a lot of Greek courses, I have not learned much about the ancient and subsequent views on Greek grammar. I am enjoying the posts so far.

Category: language Tags: Greek

Fun with Subtitles

The Natural Language Processing course by Dan Jurafsky and Chris Manning has commenced. The class is comprised of readings, video lectures, problems, and code examples. I have been working through the video lectures and was pleased to see that each one has English subtitles. This being a natural language processing course, I decided it was obligatory that I process these subtitles.

So I decided to download each subtitle file, concatenate them, normalize the text, tokenize the text, and remove stopwords. I take the resulting list and sort by the frequency of occurrence. Here's the top 50.

word - 113
we're - 107
one - 86
two - 78
like - 76
words - 75
distance - 65
let's - 57
gonna - 56
it's - 56
we'll - 54
that's - 49
string - 48
there's - 47
look - 46
end - 40
sentence - 40
example - 39
things - 39
inaudible - 38
might - 36
use - 36
going - 35
cost - 34
here's - 34
capital - 31
kind - 31
match - 31
algorithm - 30
see - 30
alignment - 29
could - 29
get - 29
text - 28
three - 28
e - 27
n - 27
regular - 27
different - 26
processing - 26
strings - 26
period - 25
case - 24
character - 24
language - 24
little - 24
characters - 23
means - 23
sound - 23
us - 23

At first I was thrown off by the presence of "e" and "n" in the list, thinking I had a bug in my tokenizer. But it turns out that the instructors say many individual letters in the course of their discussions. Comments or email with feedback are appreciated.

Category: language Tags: nltk

Biblical languages reception debrief

This past week my alma mater Multnomah University celebrated its 75th anniversary. As part of the festivities several receptions (or reunions) were planned. Most of these meetings were arranged by class year, but the big exception was the one for Biblical languages. I decided that the prospect of a meet-up with faculty, classmates, and current students was a can't-miss event, so I even delayed the start of my vacation to attend.

I arrived at the reception room a bit before the scheduled start time. The room was empty and the lights were off. However people started streaming in shortly thereafter. I was pleased when I recognized many of the people coming in the door. Overall there were four faculty members (under three of whom I studied), three alumni including myself, and perhaps a dozen current students.

It felt good to be recognized by old faculty, and apparently my thesis Short Goliath is still remembered. I was saddened that the almuni turnout was so low, but not particularly surprised. The faculty asked us to share a bit about the post-college experience of Biblical languages with the students. I recommended that folks get involved with online communities centered on Greek and Hebrew.

After the brief introductions, we split up to socialize. I think our group had the most fun of all he reunions. The Biblical Languages room was the most full, and certainly the loudest (with laughter ringing out almost non-stop). It was fun, it felt like family. The shared experience of learning languages leads to a lot of laughter and boding. I am glad I went.

Category: language

Polytonic Greek in Dvorak layout for Linux

I type with the Dvorak keymap. I also type polytonic Greek, which in every operating system I use is based on the Qwerty keymap. So it gets very confusing and annoying to switch back and forth.

So I modified the Linux Xorg Greek keymap to correspond to the English Dvorak layout. And I'd like to share.

  1. Download the modified keymap.
  2. For steps 3-5 you'll need root privileges, so use sudo or su to get them.
  3. Find your kxb symbols directory - On Debian based systems it is in /usr/share/X11/xkb/symbols, but it has also been placed in /etc/X11/...
  4. Backup your existing Greek layout by copying "gr" from that folder to a safe place.
  5. Remove the ".txt" extension from the modified keymap you downloaded and place the file in your xkb symbols directory.
  6. From a terminal issue this command to "reset" your keymap to normal:
    setxkbmap -layout us -variant dvorak
    This doesn't do anything, but you'll want that in your terminal history so you can get back to regular English Dvorak.
  7. Issue the following command to make the keymap active:
    setxkbmap -layout gr -variant dvpoly

Now you are typing polytonic Greek with a Dvorak layout. Now, as others have noted, this does not have the spirit of the Dvorak keymap. It is not constructed based on actual usage of Greek, and it will not be any more efficient than the Qwerty-based layouts (and given the placement of the accent keys in my layout, it may actually be worse). The purpose of this keymap is purely to make it easier for English Dvorak typists to transition to polytonic Greek.

The primary departure I made from the English Dvorak paradigm was moving the semi-colon so that the Greek accent dead keys could be closer to one another. In this keymap, the "semi-colon" (actually Greek question mark) can be typed using the "Q" key on the Qwerty layout, and the acute and grave accents can be typed using the "Z" and "X" keys respectively (which are ";" and "q" in the Dvorak layout). Here is a screenshot of the layout for a more explicit reference.

The xkb keymaps are under the MIT license. I used this non-polytonic layout as a guide for my work. Please post with comments, questions, bugs, etc. For more info on how to type polytonic Greek in Linux, see this excellent post on B-Greek. It references the Qwerty-based layout, but the same principles apply.

Koine Greek wordlists?

I have mostly had the Greek New Testament and related works in mind as I have been working through Natural Language Processing with Python. One useful resource discussed in the book is a word-list, which are just a simple collection of words gathered from a particular language or corpus. Text auto-completion is an example of the usefulness of such lists. The only ancient Greek wordlist included with NLTK is the Swadesh list, which is fairly limited (only 207 entries).

Before I invest the effort in creating one, does anyone know of any existing unicode Koine Greek word-lists? If not, it should not be too difficult to programmatically compile one from the Greek NT. Either way I will share what I find (and any code which happens to be written for this end).

Sidgwick's First Greek Writer

First Greek Writer by Arthur Sidgwick is available for download on the Textkit Greek resources page. There is also an answer key for the exercises. I have decided to (finally) work my way through a Greek composition text. At the moment I am reading through Sidgwick's notes, and then I'll begin the actual exercises. The notes themselves are quite helpful - sometimes Sidgwick has a way of stating a rule which is much simpler and easier to remember than other presentations I have read. I'll post any thoughts on the book here, if they arise.

Category: language Tags: Greek

Greek for "blog"

I was reading a comment composed in Greek on a blog the other day and found that it used το βλογ for "blog." That is essentially a transliteration. But I started wondering what a good translation would be.

"Blog" is the result of "web" and "log." I think we can take λογος as a sufficient analog for "log." But what about the "web" part?

I often think of the "web" analogically as a spider web. The Greek for this is αραχνη, but that seems too specific to spiders. The other possibility would be "net," which made me think of δικτυον (commonly used in the context of fishing, but not exclusively). This seems to be the better match, since the modern Greek word for "internet" is Διαδικτυο.

So I propose δικτυλογος as a Greek translation of blog. Google has no results for it, so I could be a trend-setter (or a complete dork). Please comment with alternative suggestions or corrections.

Category: language Tags: Greek