The Library Basement
Reading under ground


How many unique words in the Greek NT?

How many unique words are there in the Greek New Testament? Well, that depends on how you count.

I am doing some research and experimentation on indexing the Greek NT (or Koine Greek in general). One crucial aspect of indexing is to normalize the text so that potential search matches are not missed by the presence of punctuation, capitalization, contextual accentuation, etc.

At the same time there are some words which have the same normalized form which we should nonetheless count as different words, such as when morphology overlaps or different lemmas get inflected to the same forms.

So I set out to analyze the Greek NT and find how many unique instances of words there are. Namely, words are grouped if the share the same lemma, normal form, and parsing. To begin I used MorphGNT, which is based on SBLGNT. MorphGNT contains a column for the normal form of each word, as well as the parsing information, so it is just the ticket.

I used Python to find all unique instances of lemma, normal form, and parsing info. Then I used James Tauber's pyuca module to sort the results. You can find them in a compressed file here, sorted by lemma.

Using this methodology, I found 18,873 unique words in the Greek New Testament.

Here is a sample of the output:

ἅγιος ἁγίων A- ----GPM-
ἅγιος ἁγίων A- ----GPN-
ἅγιος ἁγιωτάτῃ A- ----DSFS
ἁγιότης ἁγιότητι N- ----DSF-
ἁγιότης ἁγιότητος N- ----GSF-
ἁγιωσύνη ἁγιωσύνῃ N- ----DSF-
ἁγιωσύνη ἁγιωσύνην N- ----ASF-
ἁγιωσύνη ἁγιωσύνης N- ----GSF-
ἀγκάλη ἀγκάλας N- ----APF-

Anyway, I hope to have more to share on this front later, but this just tickled my fancy.

A categorized, tagged Greek New Testament corpus

I have published a categorized, tagged Greek New Testament useful for natural language processing. I am calling it sblgnt-corpus. The text comes from the SBGNT and the morphological tags come from the MorphGNT project.

The text is broken up with one book per file. Each file has one or more categories (e.g. gospel and pauline). In the files there is one sentence (not verse) per line. Sentences are demarcated by punctuation . ; and ·. This makes it easy to tokenize sentences by splitting on newlines. Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here is an example:

εὐθυμεῖ/V-3PAIS τις/RI-NSM ;/;
ψαλλέτω/V-3PADS ./.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
        return tag

sblgnt = CategorizedTaggedCorpusReader('sblgnt-corpus/', 
    '\d{2}-.*', encoding=u'utf8',

Now through the sblgnt object you have access to tagged words - sblgnt.tagged_words(), simplified tags - sblgnt.tagged_words(simplify_tags=True), tagged sentences - sblgnt.tagged_sents(), and textual categories - sblgnt.words(categories='gospel').

That should be enough to kickstart the exploration of the Greek New Testament with natural language processing.

Better tokenization of the SBLGNT

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the [wordpunct_tokenize][] function. This uses the following regular expression for matching tokens:

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior:

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexp_tokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name It should be run after the script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Prep the SBLGNT for use as an NLTK corpus

The SBLGNT is available as a plain-text download, which is my personal favorite format for text processing. I have been wanting to put the SBLGNT into a Natural Language Toolkit corpus for ease in text processing for quite some time, and decided to get around to it yesterday.

First of all, the plain text of the SBLGNT has a few undesirable features for this task. First, each verse is prefixed with the verse number and the tab character, which is great for many applications but not for corpus linguistics. Second, the text contains Windows-style linebreaks and other extraneous whitespace. Third, the text contains text-critical signs.

So I wrote a script to download the plaintext archive, extract the text, and normalize it for use in NLTK.

Fir download and extract or checkout the repo. To install requirements:

$ pip install -r requirements.txt

Next, run the script:

$ python

Now you have a collection of text files, one for each book of the New Testament, in a directory called "out". You can know use these with NLTK. For example:

>>> import nltk >>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8') >>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])

You end up with sblgnt as an NLTK corpus object and sblgnt_text as an NLTK text object. You can refer to the NLTK documentation for the various uses of these. Please take note of the encodings. If you don't pay attention, you'll get lots of encoding errors when working with a unicode text and NLTK.

One thing you can do is run the collocations method on sblgnt_text:

>>> sblgnt_text.collocations() Building collocations list τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν

I'll have to look into tweaking the NLTK tokenizer, because, as you can see, it is treating elisions as tokens, which may or may not be grammatically correct (I'll have to think about that and ask around). Another cool trick, the generate method:

>>> sblgnt_text.generate(50) Building ngram index... ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,

So that's that. At some point I'll attempt to make a tagged text based on the MorphGNT (which is being re-based off SBLGNT).

Christianity and Copyright (4): The future is bright

There are practical and ethical problems with using copyright law to restrict the copying and distribution of the Bible and other Christian texts, study materials, and music. This mode of financing is still predominant among Bible publishers. However, I am much more optimistic now than I was when I first started writing about this issue in 2008. There have been a lot of positive developments, and I believe that the momentum in Bible punishment is behind permissively-licensed texts. This is good for everyone.

Our story begins with the venerable NET Bible, which has actually been around for quite a while. The NET's licensing terms are somewhat restrictive and confusing, but were nonetheless a far sight more permissive than anything else available at the time of its release. The NET Bible was a bold and deliberate step in the right direction, and it is no accident that it coincided with the rise of the world wide web. From the preface:

We don’t like the copyright notice on the second page of the NET Bible, but we don’t yet know the best way to fix it. The reason for this dilemma is that we stand at the beginning of a new era made possible by the Internet. New approaches to ministry, publishing, distribution, and collaboration are made possible by the Internet. When the first Bibles and books began to be printed rather than copied by hand, new issues emerged (plagiarism, author’s rights, freedom of the press versus censorship, copyright laws, etc.). It is now time to recognize that the copyright and permissions conventions carried over from printed books must now be upgraded for the Internet age.

I believe it is fair to say that since the time that preface was written, the new era of online copyright licenses pined for has come about. There is a lot more infrastructure to support freely-licensed works in the internet age.

While the NET Bible got the ball rolling, other Bible resources have been building momentum of late. Logos Bible Software has published the Lexham English Bible and the Society of Biblical Literature Greek New Testament (SBLGNT) under fairly permissive licenses. I do have some minor gripes about these licenses, but I find the SBLGNT development particularly encouraging given the frustrating history of copyright issues surrounding modern eclectic editions of the Greek New Testament.

Even in cases where texts have a more traditional licenses (and by traditional I mean "similar to the NIV") there have been positive developments. Many publishers are permitting or creating their own excellent web interfaces to their texts, so that anyone with an internet connection can read them on demand. Furthermore some have created public APIs, making it so their texts can be integrated into third-party applications. The ESV is a pioneer in this regard.

Yes, now is an exciting time for Christianity and Copyright, but there is still more progress to be made. On the whole, I'd like to see these efforts toward more permissive licensing and accessibility become more unified and standardized. To that end I have previously suggested that content publishers ought to use the already-established Creative Commons suite of licenses. There is no point in recreating the wheel, and I am skeptical that the terms of all these custom licenses are truly necessary. Using a standard group of easily-understood licenses makes it easier for people to understand how they can make use of the work.

In the realm of online applications and interoperability, I'd like to see more moves toward a standard data interchange format as well as a common API for accessing Bible data on the web. This will make newly-liberated texts easier to make use of in modern technologies. Work is already being done on this front, and I believe it will come to fruition.

Additionally I would like to see this spirit of openness spread to other works of Christian scholarship and culture. Worship music, academic texts, journal articles, devotionals, and the like should be freely licensed just like the Bible. It it is not a big stretch to apply the same arguments I put forth concerning the scriptures to other matters of Christian culture.

I am quite optimistic about Christianity and Copyright. The current legacy of restrictive copyrights results mostly from publisher-financed translation and fear of change. Christians intrinsically agree that the Bible should be freely-licensed. The only reason restriction has been tolerated is to raise funds, but that is no longer necessary. We are at a temporary impasse, and I think the age of limited quotation to 250 verses will be only a brief memory in the history of the church.

There is yet a lot of work to do. Anyone can help out, by petitioning copyright holders to change their licenses, by licensing their own work permissively, and by participating in projects which make sharing of the scriptures easier. By creating new works and pressuring the rights holders of existing works to change their stance, it will not be long until there is a very fertile ecosystem of freely-licensed Bibles and other Christian works. That is a time to which I am looking forward.

XML and the Bible

While working on an importer to bring the SBL Greek New Testament into Open Scriptures, I noticed some interesting features of the SBLGNT XML file. (I promised that I would try to exclude posts of a technical nature from this blog, but I am breaking that promise, because I think this technical discussion is interesting and applicable to Biblical studies.)

The SBLGNT's XML representation of the Biblical text makes an interesting distinction between tags which have child elements and childless tags. That is, normal XML tags encompass the actual Greek text and its structures (such as paragraphs and books), while childless tags represent insertions which are not original to the text. Here is a truncated Matthew 1:1 in the SBLGNT XML as an example:

<book id="Mt">

\<title>ΚΑΤΑ ΜΑΘΘΑΙΟΝ\</title>


\<verse-number id="Matthew 1:1">1:1\</verse-number>


\<suffix> \</suffix>



\<suffix>. \</suffix>


Notice how there is no "verse" tag which encompasses all of the included text. Instead "verse-number" is a tag which is inserted where ever the verse breaks are located. This is opposed to the "p" (paragraph) tag, which encompasses all of the child "w" (word) and "suffix" (spaces and punctuation) tags. Paragraphs are of course present in the original biblical text.

One thing I might have done to take this principle even further would be to put the Book titles where they appear in the Greek manuscripts. In SBLGNT XML, the title is always the first child element of the "book" tag. However, that is not always where the title was in the manuscripts. Sometimes it was printed at the end of the book.

I like the distinction between textual forms and externally imposed structures as reflected in this XML document. I'm not sure what Logos' exact thinking was behind these design choices, but I think I've identified it.

SBL Greek New Testament Licensing

A lot has been said already about the SBL Greek New Testament text (SBLGNT) which has just been released. It is not every day that a new eclectic text is released. I think that what's just as noteworthy as the release itself is the license under which the text is distributed.

As has been discussed on the Open Scriptures mailing list, the license is fairly permissive, and allows for gratis distribution. In Creative Commons terms, it is essentially a Attribution/Non-commercial license. Actually the non-commercial part is qualified, because it is allowed in sold works under certain circumstances. There are a few curious features of the license I'd like to address.

Derivative Works?

The text of the license is silent on whether it is permissible to distribute modified copies of the SBLGNT. For example, am I allowed to "fork" the text and distribute it with my own changes? It's not clear to me from the license. The license does specifically disallow one type of derivative work (I'll discuss that below), so the implication is that derivative works are otherwise OK.

Reporting Requirement

One of the requirements for qualified commercial distribution is reporting:

If you give away the SBLGNT for use with a commercial product or sell a print or electronic work containing more than 500 verses from the SBLGNT, you must annually report the number of units sold, distributed, and/or downloaded to the Society of Biblical Literature’s Rights and Permissions Office.

This is not an overly harsh measure, in my opinion, and I can certainly understand why the require it. But ideally a license would have no reporting requirement. It's just a little extra burden.

English Diglots

As I mentioned above, the SBLGNT license has one prominent exception for redistribution: if you are going to distribute an English diglot, you need to obtain a separate license from SBL. Presumably this is because they are considering publishing their own English-Greek diglot. You can publish a diglot with other languages than English. I also wonder whether an edition with three or more languages including the SBLGNT and English are permissible. It is a bit of an odd stipulation for a license given that it is so particular, but it is not the worst thing in the world.


As I have written before, I think that the Christian scriptures should not have any copy restrictions placed on them. So I think it would be ideal for the SBLGNT text and apparatus to be released into the Public Domain. Barring that, I think I would recommend re-licensing under something like a Creative Commons Attribution/Non-commercial license which is a bit more established. That would iron out some of the ambiguities of the current license.

Here I've gone and looked the gift horse in the mouth. I am quite pleased by the release of the SBLGNT. Moreover, I think its license is one of the best available for a scriptural text. The permissive license instantly makes this text one of the most important available. It is definitely a step in the right direction. I believe we'll be working on getting the text imported into Open Scriptures before too long.