The SBLGNT is available as a plain-text download, which is my
personal favorite format for text processing. I have been wanting to put
the SBLGNT into a Natural Language Toolkit corpus for ease in text
processing for quite some time, and decided to get around to it
First of all, the plain text of the SBLGNT has a few undesirable
features for this task. First, each verse is prefixed with the verse
number and the tab character, which is great for many applications but
not for corpus linguistics. Second, the text contains Windows-style
linebreaks and other extraneous whitespace. Third, the text contains
So I wrote a script to download the plaintext archive, extract the
text, and normalize it for use in NLTK.
Fir download and extract or checkout the repo. To install requirements:
$ pip install -r requirements.txt
Next, run the script:
$ python sblgnt-nltk.py
Now you have a collection of text files, one for each book of the New
Testament, in a directory called "out". You can know use these with
NLTK. For example:
>>> import nltk >>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8') >>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])
You end up with sblgnt as an NLTK corpus object and sblgnt_text as an
NLTK text object. You can refer to the NLTK documentation for the
various uses of these. Please take note of the encodings. If you don't
pay attention, you'll get lots of encoding errors when working with a
unicode text and NLTK.
One thing you can do is run the collocations method on sblgnt_text:
>>> sblgnt_text.collocations() Building collocations list τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν
I'll have to look into tweaking the NLTK tokenizer, because, as you can
see, it is treating elisions as tokens, which may or may not be
grammatically correct (I'll have to think about that and ask around).
Another cool trick, the generate method:
>>> sblgnt_text.generate(50) Building ngram index... ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,
So that's that. At some point I'll attempt to make a tagged text based
on the MorphGNT (which is being re-based off SBLGNT).