The Library Basement
Reading under ground


Moved some git repositories

I have become enamored of Gogs, a self-hosting solution for git repositories, so I've moved most of my personal repositories from a certain large centralized git service provider to my own instance. Check it out:

I understand this may require collaborates to actually use git in the manner in which it was designed - namely as decentralized version control. If you'd like to submit a patch to one of my projects, you'll need to craft a git pull request and email me.

Updates to lxxmorph-unicode and lxxmorph-corpus

The conversion of the original CATSS betacode into unicode relies on, a script published by James Tauber. James has just done an early release of greek-utils, which includes an updated version of the betacode converter. I tested the new version with the CATSS conversion process, and discovered that it produces the same output, with an exception.

Thanks to a bug-fix in the new version, a fault in the existing unicode texts has been discovered and corrected. Specifically, initial ρ was marked with smooth breathing marks instead of rough breathing. This changed 1,017 lines in 58 of the books.

Check the lxxmorph-unicode and lxxmorph-corpus repositories for updates.

LXX Mean Log Word Frequency

In another recent post, James Tauber covers the topic of mean log word frequency as a measure of difficulty of reading. I adapted James' code to address the CATSS LXX data (new format still forthcoming, I promise!).

Output columns are mean log word frequency, and then book+chapter (insert caveat about LXX chapter numbering). The higher the first column, the harder to read. Here are the top ten hardest chapters by this measure:

7983 Wisdom 17
7708 1 Chronicles 1
7281 2 Esdras 2
7217 Proverbs 10
7128 3 Maccabees 4
7114 Wisdom 11
7110 Wisdom 13
7092 Proverbs 14
7076 Proverbs 11
7065 Wisdom 14

Wisdom 17 has the unique distinction of having a broader vocabulary than a list of names, which is impressive. And now the easiest:

4752 Psalm 133
4752 Psalm 135
4792 Deuteronomy 30
4809 Deuteronomy 26
4836 1 Kings 13
4837 Psalm 116
4851 Ezekiel 20
4852 2 Chronicles 7
4856 Ezekiel 37
4881 Ezekiel 33

Top five hardest books:

6777 Wisdom
6677 4 Maccabees
6664 3 Maccabees
6634 Proverbs
6484 Joshua A
6402 2 Maccabees

Books composed in Greek appear to have the edge. I included a sixth because "Joshua A" in this corpus is not a full-length work. Now the five eastiest:

5189 Deuteronomy
5205 2 Chronicles
5239 1 Samuel
5242 Judges A
5256 Baruck

That's all for now.

LXX Vocabulary Coverage

James Tauber is blogging daily until SBL, and several of his posts have piqued my interest, so expect to see a few derivative posts here.

The first post covers vocabulary coverage statistics for the SBLGNT. The concept can take a moment to wrap your mind around: given a count of vocabulary learned (vertical axis), assuming you want to be able to read a certain percentage of words in a verse (horizontal axis), in what percentage of verses will you be successful (intersection of the two).

My usual instinct when reading posts about New Testament Greek is to try the same thing with the Septuagint. Here is the data for the LXXM using the methodology outlined in James' post:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
   100    99.78%    88.63%    27.16%     1.99%     0.74%     0.62% 
   200    99.80%    94.19%    51.25%     8.65%     2.58%     1.56% 
   500    99.84%    98.38%    78.01%    33.00%    13.95%     8.30% 
  1000    99.89%    99.35%    89.86%    58.46%    34.27%    23.08% 
  2000    99.92%    99.61%    95.93%    79.25%    59.45%    46.20% 
  5000    99.99%    99.87%    98.67%    93.72%    85.12%    77.44% 
 10000   100.00%    99.99%    99.78%    98.31%    95.33%    92.15% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

(In order to obtain the necessary input data, I had to restructure the lxxmorph-unicode dataset - after proofing I'd like to release the new format soon.)

Say you had learned 500 words, and only wanted to look up about one word per verse (90%), you would be successful in 13.95% of verses. Another way of looking at it: if you wanted to know 75% of words in 90% of verses, how big would your vocabulary need to be? About 1000 words.

I have been convinced by smart and experienced educators that vocabulary mastery really is the key to mastery of reading Greek. Just imagine the frustration of having to look up words that often even after learning so many. Wait, you probably don't have to imagine it - we've all been there! Vocab is king.

The LXX is a much bigger corpus than the New Testament (and maybe has more lexical diversity - perhaps the subject of a forthcoming post). By way of comparison with the above, a vocab of 500 targeting 90% coverage would be successful in 36.57% of verses.

I wonder if maybe the number of proper nouns in the LXX may significantly skew these numbers. Proper nouns are not vocabulary words per se - the knowledge and memory of them works differently than vocab words. So what if I remove them from consideration (in this case just filtering out words which start with a capital letter from the input file). This decreased the word count from 623,685 to 589,731. Here is the updated coverage:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
   100    99.91%    91.57%    40.48%     6.54%     3.36%     3.02% 
   200    99.92%    95.92%    63.30%    18.17%     7.77%     5.79% 
   500    99.97%    99.21%    85.79%    47.92%    26.03%    18.13% 
  1000    99.99%    99.84%    94.92%    72.53%    49.98%    38.05% 
  2000    99.99%    99.97%    98.84%    89.13%    74.26%    63.61% 
  5000   100.00%   100.00%    99.92%    98.40%    93.92%    89.70% 
 10000   100.00%   100.00%   100.00%   100.00%   100.00%   100.00% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

That change upped the 500/90% result to 26.03% from 13.95%. Still pretty daunting, but less intimidating when you think of it that way.

Fun with LXXM-Corpus

Once I have a text available for natural language processing, there are a few basic tasks I like to perform to kick the tires. First, I like to run the collocations method of NLTK, which gives common word pairs from the text. For the LXXM, here are the results:

  • ἐν τῇ
  • ἐν τῷ
  • ὁ θεὸς
  • τῆς γῆς
  • καὶ εἶπεν
  • λέγει κύριος
  • ἀνὰ μέσον
  • τὴν γῆν
  • τοῦ θεοῦ
  • ὁ θεός
  • τάδε λέγει
  • πρός με
  • πάντα τὰ
  • ὁ βασιλεὺς
  • οὐ μὴ
  • οὐκ ἔστιν
  • τῇ ἡμέρᾳ
  • οἱ υἱοὶ
  • τῷ κυρίῳ
  • τοῦ βασιλέως

If you disregard the stop words, you can get a decent idea of the fundamental thematic content of the text.

Now for the silliness, using the n-gran random text generator:

ἐν ἀρχῇ ὁδοῦ πόλεως ἐπ' ὀνόμασιν φυλῶν τοῦ Ισραηλ παρώξυναν οὐκ ἐμνήσθησαν διαθήκης ἀδελφῶν καὶ ἐξαποστελῶ πῦρ ἐπὶ Μωαβ ἐν τῷ ἐξαγαγεῖν σε τὸν ἱματισμόν

A categorized, tagged Septuagint corpus

Last year I created a version of the SBLGNT for use as categorized, tagged, corpus for natural language processing. Now I have done the same with a Septuagint text. I am calling it LXXMorph-Corpus. The source for text and tags is my unicode conversion of the CATSS LXXMorph text. There is at least one category for each file.

The text is arranged with one book per file. Certain books in the source LXXMorph text are split where there is significant textual divergence (manuscript B and A, or the Old Greek and Theodotion). Each file has one or more categories (e.g. pentateuch and writings).

Since there is no punctuation in the source text, the files are laid out with one verse per line. A better arrangement from an NLP perspective would be one line per sentence (thereby preserving the semantic structure). Maybe someday we'll have a freely-licensed LXX text which will include sentence breaks.

Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
        return tag

lxx = CategorizedTaggedCorpusReader('lxxmorph-corpus/', 
    '\d{2}\..*', encoding=u'utf8',

Now through the lxx object you have access to tagged words - lxx.tagged_words(), simplified tags - lxx.tagged_words(simplify_tags=True), tagged sentences - lxx.tagged_sents(), and textual categories - lxx.words(categories='former-prophets').

This is a derivative work of the original CATSS LXXMorph text, and so your use of it is subject to the terms of that license. See the README file for more details.

Unicode conversion of the LXX Morph text

I had been looking for a morphologically-tagged LXX for research and
came across the CATSS LXXM text. The one thing lacking for my use of
this text was that it was in betacode and not in unicode.

By searching I have found that many people have taken this text and
converted it to unicode for embedding in web sites, but to my knowledge
nobody is publishing the equivalent plain text files. The Unbound Bible
comes closest, but it publishes the text and the morphological analysis
in two separate files, which is suboptimal. So I decided to embark on
converting the LXXM to unicode.

Luckily James Tauber has shared a Greek betacode to unicode conversion
script which took care of most of the hard work for me. Using this,
I was able to convert all of the texts to betacode to unicode. I am
sharing the result as a git archive: lxxmorph-unicode.

The texts differ from the originals in the following ways:

  1. Several corrections have been applied.
  2. The betacode text has been converted to unicode.
  3. The files are now whitespace-separated rather than fixed-width.
  4. The second column, containing the POS and parsing information, has had its whitespace replaced with hyphens in accordance with the above.
  5. Combined the split files of Genesis, Psalms, Isaiah, Jeremiah, and Ezekiel, and renumbered all the files.

Please note that this resource has a rather novel license which
requires users to fill out a user declaration and send it in to the
CCAT program at the University of Pennsylvania (see
0-user-declaration.txt in the repo). As far as I can tell, my redistribution of the unicode version complies with the license. I have contacted Robert Kraft (the former steward) and Bernard Taylor (the current steward) with the corrections I've found.

(link to the original announcement on the Open Scriptures mailing list)

Swete LXX Downloader

Edit March 8, 2013: The Internet Archive has a better set of scans of Swete. It includes options for direct download of the files in various formats, including images.

The Christian Classics Ethereal Library hosts [scans of H.B. Swete's "Old Testament in Greek According to the Septuagint."][] It is a public domain LXX, including introduction and textual apparatus. If you find yourself desiring to store the images on your computer instead of viewing them through CCEL's website, you can use this Python script which I created. It will grab all of the PNG files of the text and apparatus and arrange them in order, by volume.

The script itself is not much, but I decided to license it under the GNU General Public License, version 3. This script uses some Python 2 syntax, so I might convert it for Python 3 at some point (though it's not so great a taks). If CCEL decides to change the structure of their site, it may break this script.

I considered hosting the finished product here, and may do so in the future, but for now I am going to preserve the bandwidth.

[scans of H.B. Swete's "Old Testament in Greek According to the Septuagint."]:


Today I discovered a very interesting project: The Hexapla Institute.

The purpose of the Hexapla Institute is to publish a new critical edition of the fragments of Origen's Hexapla, an endeavor which might be described as, "A Field for the 21st Century" to be available in a print edition and as an online database.

In other  words, it's about the coolest project I've ever heard of. Sadly the website seems a bit out of date, so I'm not sure how/if the project is progressing at this point.

Thesis First Draft Done

I have completed the first draft of my thesis Short Goliath, which is an examination in the textual discrepency between the Hebrew Masoretic Text and the Old Greek (a.k.a. Septuagint) accounts of the story of David and Goliath.  This feels really good, because it has been a long time coming, including a break from working on it for most of last semester.  Now I will start the revision process, which will be fairly involved.  The final draft is due the first week of April.  Once I have the paper in a form I like, I will be posting it here under a Creative Commons license. Here are some gory details:

  • 83 pages
  • 23,656 words
  • File saved: 477 times
  • Total edit time: over 45 hours (just on this one document - doesn't even come close to total time committed)
  • File created: April 22, 2008