The Library Basement
Reading under ground

Tag Greek

Unicode conversion of the LXX Morph text

I had been looking for a morphologically-tagged LXX for research and
came across the CATSS LXXM text. The one thing lacking for my use of
this text was that it was in betacode and not in unicode.

By searching I have found that many people have taken this text and
converted it to unicode for embedding in web sites, but to my knowledge
nobody is publishing the equivalent plain text files. The Unbound Bible
comes closest, but it publishes the text and the morphological analysis
in two separate files, which is suboptimal. So I decided to embark on
converting the LXXM to unicode.

Luckily James Tauber has shared a Greek betacode to unicode conversion
script which took care of most of the hard work for me. Using this,
I was able to convert all of the texts to betacode to unicode. I am
sharing the result as a git archive: lxxmorph-unicode.

The texts differ from the originals in the following ways:

  1. Several corrections have been applied.
  2. The betacode text has been converted to unicode.
  3. The files are now whitespace-separated rather than fixed-width.
  4. The second column, containing the POS and parsing information, has had its whitespace replaced with hyphens in accordance with the above.
  5. Combined the split files of Genesis, Psalms, Isaiah, Jeremiah, and Ezekiel, and renumbered all the files.

Please note that this resource has a rather novel license which
requires users to fill out a user declaration and send it in to the
CCAT program at the University of Pennsylvania (see
0-user-declaration.txt in the repo). As far as I can tell, my redistribution of the unicode version complies with the license. I have contacted Robert Kraft (the former steward) and Bernard Taylor (the current steward) with the corrections I've found.

(link to the original announcement on the Open Scriptures mailing list)

Better tokenization of the SBLGNT

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the [wordpunct_tokenize][] function. This uses the following regular expression for matching tokens:

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior:

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexp_tokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name It should be run after the script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Prep the SBLGNT for use as an NLTK corpus

The SBLGNT is available as a plain-text download, which is my personal favorite format for text processing. I have been wanting to put the SBLGNT into a Natural Language Toolkit corpus for ease in text processing for quite some time, and decided to get around to it yesterday.

First of all, the plain text of the SBLGNT has a few undesirable features for this task. First, each verse is prefixed with the verse number and the tab character, which is great for many applications but not for corpus linguistics. Second, the text contains Windows-style linebreaks and other extraneous whitespace. Third, the text contains text-critical signs.

So I wrote a script to download the plaintext archive, extract the text, and normalize it for use in NLTK.

Fir download and extract or checkout the repo. To install requirements:

$ pip install -r requirements.txt

Next, run the script:

$ python

Now you have a collection of text files, one for each book of the New Testament, in a directory called "out". You can know use these with NLTK. For example:

>>> import nltk >>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8') >>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])

You end up with sblgnt as an NLTK corpus object and sblgnt_text as an NLTK text object. You can refer to the NLTK documentation for the various uses of these. Please take note of the encodings. If you don't pay attention, you'll get lots of encoding errors when working with a unicode text and NLTK.

One thing you can do is run the collocations method on sblgnt_text:

>>> sblgnt_text.collocations() Building collocations list τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν

I'll have to look into tweaking the NLTK tokenizer, because, as you can see, it is treating elisions as tokens, which may or may not be grammatically correct (I'll have to think about that and ask around). Another cool trick, the generate method:

>>> sblgnt_text.generate(50) Building ngram index... ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,

So that's that. At some point I'll attempt to make a tagged text based on the MorphGNT (which is being re-based off SBLGNT).

A reader's Christmas

Loeb SophoclesI was given the Loeb volume including Sophocles' Antigone, The Women of Trachis, Philoctetes, and Oedipus at Colonus. I am very excited to have some new Greek literature to read in the new year. In addition, this is my first Loeb volume, which feels pretty nice. I remember looking longingly at the many shelves of green volumes in the library basement in college. Wish fulfilled.

I'll be sure to share insights and curiosities from these texts as I come across them in the reading process. Soon I'll develop a reading schedule to get through them in a reasonable amount of time.
Category: language Tags: Greek

Greek grammar in Greek

William Annis' twin sites Aoidoiand Scholiastae are great resources for annotated Greek poetry and prose, respectively. I have not delved too much into the offerings, but what I have seen is impressive.

My favorite resource on Scholiastae is Greek Grammar in Greek (PDF), which provides vocabulary and some introduction for the discussion of the Greek language in Greek. I appreciated this so much that I made a Quisition flashcard deck of these terms.

Category: language Tags: Greek

Bundled wordplay

I ran across this enchanting line in Sirach 21:9 in the LXX:

στιππύον συνηγμένον συναγωγὴ ἀνὀμων

The NETS renders this as "a gathering of the lawless is bundled flax." (Followed by "and a flame of fire is their end.") I was struck by the use of the same lexical root συνἀγω for the subject and the adjective of the object, which makes for a great wordplay. To bring that out in English, you would have to do something more like "a gathering of the lawless is gathered flax", which does not quite sound right, unfortunately.

The history of Greek grammars

Mike Aubrey is working on a series of posts detailing the history of Greek grammars. The long outline already has five posts in addition to an ever-expanding bibliography. As someone who took a lot of Greek courses, I have not learned much about the ancient and subsequent views on Greek grammar. I am enjoying the posts so far.

Category: language Tags: Greek

Polytonic Greek in Dvorak layout for Linux

I type with the Dvorak keymap. I also type polytonic Greek, which in every operating system I use is based on the Qwerty keymap. So it gets very confusing and annoying to switch back and forth.

So I modified the Linux Xorg Greek keymap to correspond to the English Dvorak layout. And I'd like to share.

  1. Download the modified keymap.
  2. For steps 3-5 you'll need root privileges, so use sudo or su to get them.
  3. Find your kxb symbols directory - On Debian based systems it is in /usr/share/X11/xkb/symbols, but it has also been placed in /etc/X11/...
  4. Backup your existing Greek layout by copying "gr" from that folder to a safe place.
  5. Remove the ".txt" extension from the modified keymap you downloaded and place the file in your xkb symbols directory.
  6. From a terminal issue this command to "reset" your keymap to normal:
    setxkbmap -layout us -variant dvorak
    This doesn't do anything, but you'll want that in your terminal history so you can get back to regular English Dvorak.
  7. Issue the following command to make the keymap active:
    setxkbmap -layout gr -variant dvpoly

Now you are typing polytonic Greek with a Dvorak layout. Now, as others have noted, this does not have the spirit of the Dvorak keymap. It is not constructed based on actual usage of Greek, and it will not be any more efficient than the Qwerty-based layouts (and given the placement of the accent keys in my layout, it may actually be worse). The purpose of this keymap is purely to make it easier for English Dvorak typists to transition to polytonic Greek.

The primary departure I made from the English Dvorak paradigm was moving the semi-colon so that the Greek accent dead keys could be closer to one another. In this keymap, the "semi-colon" (actually Greek question mark) can be typed using the "Q" key on the Qwerty layout, and the acute and grave accents can be typed using the "Z" and "X" keys respectively (which are ";" and "q" in the Dvorak layout). Here is a screenshot of the layout for a more explicit reference.

The xkb keymaps are under the MIT license. I used this non-polytonic layout as a guide for my work. Please post with comments, questions, bugs, etc. For more info on how to type polytonic Greek in Linux, see this excellent post on B-Greek. It references the Qwerty-based layout, but the same principles apply.

Koine Greek wordlists?

I have mostly had the Greek New Testament and related works in mind as I have been working through Natural Language Processing with Python. One useful resource discussed in the book is a word-list, which are just a simple collection of words gathered from a particular language or corpus. Text auto-completion is an example of the usefulness of such lists. The only ancient Greek wordlist included with NLTK is the Swadesh list, which is fairly limited (only 207 entries).

Before I invest the effort in creating one, does anyone know of any existing unicode Koine Greek word-lists? If not, it should not be too difficult to programmatically compile one from the Greek NT. Either way I will share what I find (and any code which happens to be written for this end).

Sidgwick's First Greek Writer

First Greek Writer by Arthur Sidgwick is available for download on the Textkit Greek resources page. There is also an answer key for the exercises. I have decided to (finally) work my way through a Greek composition text. At the moment I am reading through Sidgwick's notes, and then I'll begin the actual exercises. The notes themselves are quite helpful - sometimes Sidgwick has a way of stating a rule which is much simpler and easier to remember than other presentations I have read. I'll post any thoughts on the book here, if they arise.

Category: language Tags: Greek

Greek for "blog"

I was reading a comment composed in Greek on a blog the other day and found that it used το βλογ for "blog." That is essentially a transliteration. But I started wondering what a good translation would be.

"Blog" is the result of "web" and "log." I think we can take λογος as a sufficient analog for "log." But what about the "web" part?

I often think of the "web" analogically as a spider web. The Greek for this is αραχνη, but that seems too specific to spiders. The other possibility would be "net," which made me think of δικτυον (commonly used in the context of fishing, but not exclusively). This seems to be the better match, since the modern Greek word for "internet" is Διαδικτυο.

So I propose δικτυλογος as a Greek translation of blog. Google has no results for it, so I could be a trend-setter (or a complete dork). Please comment with alternative suggestions or corrections.

Category: language Tags: Greek

Actually reading Greek

Daniel R. Street has an excellent post on "reading" Greek as it relates the teaching of the language. My favorite part, describing what is wrong with "grinding" through Greek:

The student is focused on form, grammar and syntax, and often on analyzing, labeling or parsing various elements, not on comprehension. Indeed, comprehension is delayed until the translation into English is completed and can be read by the student.

The struggle is to transform Greek pedagogy so that it encourages reading. However I suspect that there is also the hurdle that some pupils will not really want to learn how to fluently read Greek.

Category: language Tags: Greek

Greek on the train

This morning I decided to bring my Greek New Testament on the train. The copy I decided to bring is the Utriusque Testamenti, which is the Greek New Testament (Nestle-Aland 27) bound with the Hebrew Bible (BHS). Note to self: "it's too big; bring a smaller one next time." I'll probably bring the Zondervan Reader's Greek for ease of vocabulary lookups.

I am setting out to read the Gospel of Luke. Being a bit rusty the reading was slow going, but I am already picking up the pace. It's been a while since I read Greek on the train, so I will have to get used to the sideways glances of my neighbors (and occasional outright stares). Wish me luck.

Category: language Tags: Greek

Whom do scholars work for?

Mike Aubrey posed this question:

Should placing a priority on exegetical payoff in linguistic & grammatical research be viewed as a strength or a weakness?

Upon reading this I was conflicted. I ultimately agree with Mike's point, that the primary goal of linguistic research should be to understand the language in and of itself. But I am reluctant to agree to it outright, since I think there is already enough bashing of "pragmatic" Greek instruction going on. So I pointed out that researchers should have a further goal in mind for their work.

I'll appropriate the term "primary" (as used by Aubrey in his reply to me) to quite literally mean "first." Yes, the first thing a linguistic researcher needs to do is the understand the structure of the language in question. Further tasks such as exegesis and translation should depend on this foundation. But I do not want to confuse "first" with "most important." Very few people are going to be interested in the structure of Hellenistic Greek for its own sake. They will want to use this knowledge to gain some other benefit. When it comes to the biblical languages and corpus, most people are going to want interpretive insight.

It is good for scholars to assist seminarians and pastors with their ultimate goal of a better understanding the scriptures. As I understand giftedness in the church, scholars work to serve the body of Christ with knowledge. I believe that "pragmatic" Greek instruction is an absolutely appropriate ultimate goal for scholarly research. This is because Christians are called to study and preach the scriptures, not to learn Greek.

I think this same basic argument applies in secular contexts as well: know your audience.

Category: language Tags: Greek