The Library Basement
Reading under ground

Category language

Updates to lxxmorph-unicode and lxxmorph-corpus

The conversion of the original CATSS betacode into unicode relies on beta2unicode.py, a script published by James Tauber. James has just done an early release of greek-utils, which includes an updated version of the betacode converter. I tested the new version with the CATSS conversion process, and discovered that it produces the same output, with an exception.

Thanks to a bug-fix in the new version, a fault in the existing unicode texts has been discovered and corrected. Specifically, initial ρ was marked with smooth breathing marks instead of rough breathing. This changed 1,017 lines in 58 of the books.

Check the lxxmorph-unicode and lxxmorph-corpus repositories for updates.

LXX Mean Log Word Frequency

In another recent post, James Tauber covers the topic of mean log word frequency as a measure of difficulty of reading. I adapted James' code to address the CATSS LXX data (new format still forthcoming, I promise!).

Output columns are mean log word frequency, and then book+chapter (insert caveat about LXX chapter numbering). The higher the first column, the harder to read. Here are the top ten hardest chapters by this measure:

7983 Wisdom 17
7708 1 Chronicles 1
7281 2 Esdras 2
7217 Proverbs 10
7128 3 Maccabees 4
7114 Wisdom 11
7110 Wisdom 13
7092 Proverbs 14
7076 Proverbs 11
7065 Wisdom 14

Wisdom 17 has the unique distinction of having a broader vocabulary than a list of names, which is impressive. And now the easiest:

4752 Psalm 133
4752 Psalm 135
4792 Deuteronomy 30
4809 Deuteronomy 26
4836 1 Kings 13
4837 Psalm 116
4851 Ezekiel 20
4852 2 Chronicles 7
4856 Ezekiel 37
4881 Ezekiel 33

Top five hardest books:

6777 Wisdom
6677 4 Maccabees
6664 3 Maccabees
6634 Proverbs
6484 Joshua A
6402 2 Maccabees

Books composed in Greek appear to have the edge. I included a sixth because "Joshua A" in this corpus is not a full-length work. Now the five eastiest:

5189 Deuteronomy
5205 2 Chronicles
5239 1 Samuel
5242 Judges A
5256 Baruck

That's all for now.

Ancient Greek 80% Core Vocab Flashcards

Imagine you wanted to study vocabulary in advance of reading not just a work, but an entire corpus, or perhaps the super-corpus of the Greek Classics. Wilfred Major advocated for learning a core vocabulary in his paper It's Not the Size, It's the Frequency. In it he provides some good news:

Ancient Greek has a high-frequency core vocabulary scarcely half the size of many other languages.

He provides lists of lemmata which comprise 50% or 80% of the words in the ancient Greek corpus. The 80% has only 1,100 entries, so it is definitely achievable for advanced students.

I have made a flashcard pack at Quisition which contains this vocab list. If you think it would be useful to your studies, you can add it to a deck and start memorizing.

Published:
Category: language Tags: Greek

LXX Vocabulary Coverage

James Tauber is blogging daily until SBL, and several of his posts have piqued my interest, so expect to see a few derivative posts here.

The first post covers vocabulary coverage statistics for the SBLGNT. The concept can take a moment to wrap your mind around: given a count of vocabulary learned (vertical axis), assuming you want to be able to read a certain percentage of words in a verse (horizontal axis), in what percentage of verses will you be successful (intersection of the two).

My usual instinct when reading posts about New Testament Greek is to try the same thing with the Septuagint. Here is the data for the LXXM using the methodology outlined in James' post:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.78%    88.63%    27.16%     1.99%     0.74%     0.62% 
   200    99.80%    94.19%    51.25%     8.65%     2.58%     1.56% 
   500    99.84%    98.38%    78.01%    33.00%    13.95%     8.30% 
  1000    99.89%    99.35%    89.86%    58.46%    34.27%    23.08% 
  2000    99.92%    99.61%    95.93%    79.25%    59.45%    46.20% 
  5000    99.99%    99.87%    98.67%    93.72%    85.12%    77.44% 
 10000   100.00%    99.99%    99.78%    98.31%    95.33%    92.15% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

(In order to obtain the necessary input data, I had to restructure the lxxmorph-unicode dataset - after proofing I'd like to release the new format soon.)

Say you had learned 500 words, and only wanted to look up about one word per verse (90%), you would be successful in 13.95% of verses. Another way of looking at it: if you wanted to know 75% of words in 90% of verses, how big would your vocabulary need to be? About 1000 words.

I have been convinced by smart and experienced educators that vocabulary mastery really is the key to mastery of reading Greek. Just imagine the frustration of having to look up words that often even after learning so many. Wait, you probably don't have to imagine it - we've all been there! Vocab is king.

The LXX is a much bigger corpus than the New Testament (and maybe has more lexical diversity - perhaps the subject of a forthcoming post). By way of comparison with the above, a vocab of 500 targeting 90% coverage would be successful in 36.57% of verses.

I wonder if maybe the number of proper nouns in the LXX may significantly skew these numbers. Proper nouns are not vocabulary words per se - the knowledge and memory of them works differently than vocab words. So what if I remove them from consideration (in this case just filtering out words which start with a capital letter from the input file). This decreased the word count from 623,685 to 589,731. Here is the updated coverage:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.91%    91.57%    40.48%     6.54%     3.36%     3.02% 
   200    99.92%    95.92%    63.30%    18.17%     7.77%     5.79% 
   500    99.97%    99.21%    85.79%    47.92%    26.03%    18.13% 
  1000    99.99%    99.84%    94.92%    72.53%    49.98%    38.05% 
  2000    99.99%    99.97%    98.84%    89.13%    74.26%    63.61% 
  5000   100.00%   100.00%    99.92%    98.40%    93.92%    89.70% 
 10000   100.00%   100.00%   100.00%   100.00%   100.00%   100.00% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

That change upped the 500/90% result to 26.03% from 13.95%. Still pretty daunting, but less intimidating when you think of it that way.

How many unique words in the Greek NT?

How many unique words are there in the Greek New Testament? Well, that depends on how you count.

I am doing some research and experimentation on indexing the Greek NT (or Koine Greek in general). One crucial aspect of indexing is to normalize the text so that potential search matches are not missed by the presence of punctuation, capitalization, contextual accentuation, etc.

At the same time there are some words which have the same normalized form which we should nonetheless count as different words, such as when morphology overlaps or different lemmas get inflected to the same forms.

So I set out to analyze the Greek NT and find how many unique instances of words there are. Namely, words are grouped if the share the same lemma, normal form, and parsing. To begin I used MorphGNT, which is based on SBLGNT. MorphGNT contains a column for the normal form of each word, as well as the parsing information, so it is just the ticket.

I used Python to find all unique instances of lemma, normal form, and parsing info. Then I used James Tauber's pyuca module to sort the results. You can find them in a compressed file here, sorted by lemma.

Using this methodology, I found 18,873 unique words in the Greek New Testament.

Here is a sample of the output:

ἅγιος ἁγίων A- ----GPM-
ἅγιος ἁγίων A- ----GPN-
ἅγιος ἁγιωτάτῃ A- ----DSFS
ἁγιότης ἁγιότητι N- ----DSF-
ἁγιότης ἁγιότητος N- ----GSF-
ἁγιωσύνη ἁγιωσύνῃ N- ----DSF-
ἁγιωσύνη ἁγιωσύνην N- ----ASF-
ἁγιωσύνη ἁγιωσύνης N- ----GSF-
ἀγκάλη ἀγκάλας N- ----APF-

Anyway, I hope to have more to share on this front later, but this just tickled my fancy.

Fun with LXXM-Corpus

Once I have a text available for natural language processing, there are a few basic tasks I like to perform to kick the tires. First, I like to run the collocations method of NLTK, which gives common word pairs from the text. For the LXXM, here are the results:

  • ἐν τῇ
  • ἐν τῷ
  • ὁ θεὸς
  • τῆς γῆς
  • καὶ εἶπεν
  • λέγει κύριος
  • ἀνὰ μέσον
  • τὴν γῆν
  • τοῦ θεοῦ
  • ὁ θεός
  • τάδε λέγει
  • πρός με
  • πάντα τὰ
  • ὁ βασιλεὺς
  • οὐ μὴ
  • οὐκ ἔστιν
  • τῇ ἡμέρᾳ
  • οἱ υἱοὶ
  • τῷ κυρίῳ
  • τοῦ βασιλέως

If you disregard the stop words, you can get a decent idea of the fundamental thematic content of the text.

Now for the silliness, using the n-gran random text generator:

ἐν ἀρχῇ ὁδοῦ πόλεως ἐπ' ὀνόμασιν φυλῶν τοῦ Ισραηλ παρώξυναν οὐκ ἐμνήσθησαν διαθήκης ἀδελφῶν καὶ ἐξαποστελῶ πῦρ ἐπὶ Μωαβ ἐν τῷ ἐξαγαγεῖν σε τὸν ἱματισμόν

The τελος of Greek natural language processing

I dream that someday we'll have a full stack of Greek natural language processing tools to facilitate research. These tools will range from transcribing the text to advanced NLP tasks like text classification or sentiment analysis. These tools will of course be open source.

Here is an overview of the components I have imagined (with notes where the tools are already in development):

  • Optical Character Recognition to transcribe the text to a digital form (Rigaudon Polytonic Greek OCR)
  • A user interface for editing the output of the OCR system (a "collaborative corpus linguistics" suite could be used for this and other editing tasks)
  • Collation of related texts for textual criticism.
  • Morphological analysis of the text (Tauber's greek-inflection is a start)
  • Tagging of the text based on above morphological analysis
  • Indexing the text
  • Use of a context-free Grammar or other means to produce syntactical analysis of the text (e.g. syntax trees)
  • A database to store all of this information
  • An API to make this information accessible (towards which Open Scriptures has worked)

We're actually pretty close. And once the full stack is in place, it will greatly increase the speed at which new texts enter the research corpus. This influx of data will improve the results of research and lead to new applications.

Am I missing anything?

Greek WOTD - ὑπέρογκος

ὑπέρογκος

Meaning "extremely large" or "rather difficult." Spotted in Lamentations 1:9:

ἀκαθαρσία αὐτῆς πρὸς ποδῶν αὐτῆς οὐκ ἐμνήσθη ἔσχατα αὐτῆς καὶ κατεβίβασεν ὑπέρογκα οὐκ ἔστιν ὁ παρακαλῶν αὐτήν ἰδέ κύριε τὴν ταπείνωσίν μου ὅτι ἐμεγαλύνθη ἐχθρός

I pledge allegiance to the metonym and to its referent

I was contemplating the US Pledge of Allegiance today and was struck by something odd. The beginning:

I pledge allegiance to the Flag of the United States of America . . .

It is silly to pledge allegiance to a flag, so this is an obvious metonymy for the United States government. But then it continues:

. . . and to the Republic for which it stands . . .

Wait, what? I'm pledging allegiance to the metonym, and to its referent? This makes no sense. I checked the revision history of the pledge, and this twofold distinction has been present since the beginning. Why doesn't this redundancy sound strange to the hundreds of millions who have recited the pledge? Or is there really a distinction in meaning between the flag and the republic for which it stands that I am missing?

Published:
Category: language Tags: humor

Greek WOTD - στραγγαλόω

στραγγαλόω

Meaning what it sounds like, to choke or strangle. Found this in Ignatius' letter to the Trallians, 5.1.

This is the first in a possible series of Greek Word of the Day posts.

Ignatius and the improperly declined adjective

Update: I was inattentive to the fact that ἀξιόθεος is a two-termination adjective, and therefore masculine and feminine forms are morphologically fused. Thanks to Jack for pointing this out in the comments.

I just started reading Ignatius' letter to the Trallians and came across an oddity. In the opening sentence, Ignatius notes that the church to which he is writing ἐκκλησίᾳ) is "worthy of God" (ἀξιοθέω). Since this adjective is modifying a feminine noun, I would expect the feminine form here. It happens to stick out like a sore thumb because  Ignatius is using many adjectives to describe the church and they are all in feminine form (ἠγαπημένῃ, ἁγἰᾳ, κτλ), save this one.

At first I remarked to my wife: "there's a grammatical mistake in this sentence." But after considering it for a minute, it seems most likely that Ignatius put this adjective in the masculine form due to θεός being statically male in the Christian tradition. It probably "sounded wrong" to use the expected feminine ending.

I checked BDAG and found that there are attestations of ἀξιόθεος being declined for a ~~feminine~~ noun (Update: actually neuter in this case), including in Ignatius' letter to the Romans, 1:1. I am not sure if this is due to Ignatius' own idiosyncrasies, or due to transmission or spelling normalization by an editor. At the very least it was not a hard-and-fast rule, since we see both forms by the same author.

Additionally in the salutation to Romans, Ignatius does this same switching back and forth between feminine and masculine forms in describing the church. Perhaps someone has already done some research on this phenomenon, or perhaps there is a paper waiting to be written on this topic.

Published:
Category: language Tags: Greek

Rigaudon: Polytonic Greek OCR

I came across a very exciting project recently: Rigaudon. This is a polytonic Greek OCR system which has already been used on 532 texts of antiquity. The result as CC-BY-SA licensed, and the code is GPL v2, and available in a git repo. Bruce Robertson, one of the collaborators behind the project, also has other repositories, including one for a web-based interactive syntax tree editor. Check them out.

Transcription is the great boundary between the source texts and boundless application in the digital realm. A good polytonic Greek OCR system will unlock many texts which have never been digitized. This has a dual benefit: a "clean" transcription process can lead to permissive licensing for public domain works, and as a result, we'll all have a lot more texts for research.

The system is not perfect, but it is a work in progress and improvements can be made. Nonetheless, some manual editing will be required. However, these OCR results are the best I have seen for polytonic Greek. And the potential reward is so vast, I cannot help but get excited and get involved. There is already some correspondence circulating about collaborating around a particular text, which could then lead to morphological tagging and syntactic analysis, and maybe more.

In 2011 I wrote that the future is brightfor copyright issues in Christianity. This is just one example of how that is so. Free software licenses for code and permissive licenses for content are becoming the norm in the cutting edge of the field. This is good for everyone, but there is still a lot of work to do (and maybe more than ever).

Metaphorical Hygiene

From a recent Language Log post:

If someone tells you that one or more languages have no word for X, or if you find yourself using this figure of speech to make a social or cultural point, you should seek metaphysical treatment immediately.

Share your favorite language's word for diarrhea there.

Published:
Category: language Tags: humor

Unicode conversion of the LXX Morph text

I had been looking for a morphologically-tagged LXX for research and
came across the CATSS LXXM text. The one thing lacking for my use of
this text was that it was in betacode and not in unicode.

By searching I have found that many people have taken this text and
converted it to unicode for embedding in web sites, but to my knowledge
nobody is publishing the equivalent plain text files. The Unbound Bible
comes closest, but it publishes the text and the morphological analysis
in two separate files, which is suboptimal. So I decided to embark on
converting the LXXM to unicode.

Luckily James Tauber has shared a Greek betacode to unicode conversion
script which took care of most of the hard work for me. Using this,
I was able to convert all of the texts to betacode to unicode. I am
sharing the result as a git archive: lxxmorph-unicode.

The texts differ from the originals in the following ways:

  1. Several corrections have been applied.
  2. The betacode text has been converted to unicode.
  3. The files are now whitespace-separated rather than fixed-width.
  4. The second column, containing the POS and parsing information, has had its whitespace replaced with hyphens in accordance with the above.
  5. Combined the split files of Genesis, Psalms, Isaiah, Jeremiah, and Ezekiel, and renumbered all the files.

Please note that this resource has a rather novel license which
requires users to fill out a user declaration and send it in to the
CCAT program at the University of Pennsylvania (see
0-user-declaration.txt in the repo). As far as I can tell, my redistribution of the unicode version complies with the license. I have contacted Robert Kraft (the former steward) and Bernard Taylor (the current steward) with the corrections I've found.

(link to the original announcement on the Open Scriptures mailing list)

Better tokenization of the SBLGNT

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the [wordpunct_tokenize][] function. This uses the following regular expression for matching tokens:
\w+|[^\w\s]+

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior:
\w+\u2019?|[^\w\s\u2019]+

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexp_tokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name load-sblgnt.py. It should be run after the sblgnt-nltk.py script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Categories

Tags