The Library Basement
Reading under ground

Tag Greek

Moved some git repositories

I have become enamored of Gogs, a self-hosting solution for git repositories, so I've moved most of my personal repositories from a certain large centralized git service provider to my own instance. Check it out:

https://scm.smithfam.info/nathan

I understand this may require collaborates to actually use git in the manner in which it was designed - namely as decentralized version control. If you'd like to submit a patch to one of my projects, you'll need to craft a git pull request and email me.

koine-nlp release

Today I am formally releasing koine-nlp 0.2, a Python library for common NLP-related tasks for Koine Greek. I decided to make a fancy koine-nlp homepage with the help of sphinx. It includes info on installation, a tutorial, and an API reference for the koinenlp module. You can find the source repository on my gogs instance.

In the most basic mode of operation, koine-nlp is used to prepare polytonic Greek text for indexing by normalizing. This done by means of the omnibus normalize() function:

>>> import koinenlp
>>> koinenlp.normalize("καὶ ἡ σκοτία αὐτὸ οὐ κατέλαβεν.")
'και η σκοτια αυτο ου κατελαβεν'

There's plenty more to it - see the documentation for more.

I do plan on adding some features in the future, so watch this space.

Updates to lxxmorph-unicode and lxxmorph-corpus

The conversion of the original CATSS betacode into unicode relies on beta2unicode.py, a script published by James Tauber. James has just done an early release of greek-utils, which includes an updated version of the betacode converter. I tested the new version with the CATSS conversion process, and discovered that it produces the same output, with an exception.

Thanks to a bug-fix in the new version, a fault in the existing unicode texts has been discovered and corrected. Specifically, initial ρ was marked with smooth breathing marks instead of rough breathing. This changed 1,017 lines in 58 of the books.

Check the lxxmorph-unicode and lxxmorph-corpus repositories for updates.

LXX Mean Log Word Frequency

In another recent post, James Tauber covers the topic of mean log word frequency as a measure of difficulty of reading. I adapted James' code to address the CATSS LXX data (new format still forthcoming, I promise!).

Output columns are mean log word frequency, and then book+chapter (insert caveat about LXX chapter numbering). The higher the first column, the harder to read. Here are the top ten hardest chapters by this measure:

7983 Wisdom 17
7708 1 Chronicles 1
7281 2 Esdras 2
7217 Proverbs 10
7128 3 Maccabees 4
7114 Wisdom 11
7110 Wisdom 13
7092 Proverbs 14
7076 Proverbs 11
7065 Wisdom 14

Wisdom 17 has the unique distinction of having a broader vocabulary than a list of names, which is impressive. And now the easiest:

4752 Psalm 133
4752 Psalm 135
4792 Deuteronomy 30
4809 Deuteronomy 26
4836 1 Kings 13
4837 Psalm 116
4851 Ezekiel 20
4852 2 Chronicles 7
4856 Ezekiel 37
4881 Ezekiel 33

Top five hardest books:

6777 Wisdom
6677 4 Maccabees
6664 3 Maccabees
6634 Proverbs
6484 Joshua A
6402 2 Maccabees

Books composed in Greek appear to have the edge. I included a sixth because "Joshua A" in this corpus is not a full-length work. Now the five eastiest:

5189 Deuteronomy
5205 2 Chronicles
5239 1 Samuel
5242 Judges A
5256 Baruck

That's all for now.

Ancient Greek 80% Core Vocab Flashcards

Imagine you wanted to study vocabulary in advance of reading not just a work, but an entire corpus, or perhaps the super-corpus of the Greek Classics. Wilfred Major advocated for learning a core vocabulary in his paper It's Not the Size, It's the Frequency. In it he provides some good news:

Ancient Greek has a high-frequency core vocabulary scarcely half the size of many other languages.

He provides lists of lemmata which comprise 50% or 80% of the words in the ancient Greek corpus. The 80% has only 1,100 entries, so it is definitely achievable for advanced students.

I have made a flashcard pack at Quisition which contains this vocab list. If you think it would be useful to your studies, you can add it to a deck and start memorizing.

Published:
Category: language Tags: Greek

LXX Vocabulary Coverage

James Tauber is blogging daily until SBL, and several of his posts have piqued my interest, so expect to see a few derivative posts here.

The first post covers vocabulary coverage statistics for the SBLGNT. The concept can take a moment to wrap your mind around: given a count of vocabulary learned (vertical axis), assuming you want to be able to read a certain percentage of words in a verse (horizontal axis), in what percentage of verses will you be successful (intersection of the two).

My usual instinct when reading posts about New Testament Greek is to try the same thing with the Septuagint. Here is the data for the LXXM using the methodology outlined in James' post:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.78%    88.63%    27.16%     1.99%     0.74%     0.62% 
   200    99.80%    94.19%    51.25%     8.65%     2.58%     1.56% 
   500    99.84%    98.38%    78.01%    33.00%    13.95%     8.30% 
  1000    99.89%    99.35%    89.86%    58.46%    34.27%    23.08% 
  2000    99.92%    99.61%    95.93%    79.25%    59.45%    46.20% 
  5000    99.99%    99.87%    98.67%    93.72%    85.12%    77.44% 
 10000   100.00%    99.99%    99.78%    98.31%    95.33%    92.15% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

(In order to obtain the necessary input data, I had to restructure the lxxmorph-unicode dataset - after proofing I'd like to release the new format soon.)

Say you had learned 500 words, and only wanted to look up about one word per verse (90%), you would be successful in 13.95% of verses. Another way of looking at it: if you wanted to know 75% of words in 90% of verses, how big would your vocabulary need to be? About 1000 words.

I have been convinced by smart and experienced educators that vocabulary mastery really is the key to mastery of reading Greek. Just imagine the frustration of having to look up words that often even after learning so many. Wait, you probably don't have to imagine it - we've all been there! Vocab is king.

The LXX is a much bigger corpus than the New Testament (and maybe has more lexical diversity - perhaps the subject of a forthcoming post). By way of comparison with the above, a vocab of 500 targeting 90% coverage would be successful in 36.57% of verses.

I wonder if maybe the number of proper nouns in the LXX may significantly skew these numbers. Proper nouns are not vocabulary words per se - the knowledge and memory of them works differently than vocab words. So what if I remove them from consideration (in this case just filtering out words which start with a capital letter from the input file). This decreased the word count from 623,685 to 589,731. Here is the updated coverage:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.91%    91.57%    40.48%     6.54%     3.36%     3.02% 
   200    99.92%    95.92%    63.30%    18.17%     7.77%     5.79% 
   500    99.97%    99.21%    85.79%    47.92%    26.03%    18.13% 
  1000    99.99%    99.84%    94.92%    72.53%    49.98%    38.05% 
  2000    99.99%    99.97%    98.84%    89.13%    74.26%    63.61% 
  5000   100.00%   100.00%    99.92%    98.40%    93.92%    89.70% 
 10000   100.00%   100.00%   100.00%   100.00%   100.00%   100.00% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

That change upped the 500/90% result to 26.03% from 13.95%. Still pretty daunting, but less intimidating when you think of it that way.

How many unique words in the Greek NT?

How many unique words are there in the Greek New Testament? Well, that depends on how you count.

I am doing some research and experimentation on indexing the Greek NT (or Koine Greek in general). One crucial aspect of indexing is to normalize the text so that potential search matches are not missed by the presence of punctuation, capitalization, contextual accentuation, etc.

At the same time there are some words which have the same normalized form which we should nonetheless count as different words, such as when morphology overlaps or different lemmas get inflected to the same forms.

So I set out to analyze the Greek NT and find how many unique instances of words there are. Namely, words are grouped if the share the same lemma, normal form, and parsing. To begin I used MorphGNT, which is based on SBLGNT. MorphGNT contains a column for the normal form of each word, as well as the parsing information, so it is just the ticket.

I used Python to find all unique instances of lemma, normal form, and parsing info. Then I used James Tauber's pyuca module to sort the results. You can find them in a compressed file here, sorted by lemma.

Using this methodology, I found 18,873 unique words in the Greek New Testament.

Here is a sample of the output:

ἅγιος ἁγίων A- ----GPM-
ἅγιος ἁγίων A- ----GPN-
ἅγιος ἁγιωτάτῃ A- ----DSFS
ἁγιότης ἁγιότητι N- ----DSF-
ἁγιότης ἁγιότητος N- ----GSF-
ἁγιωσύνη ἁγιωσύνῃ N- ----DSF-
ἁγιωσύνη ἁγιωσύνην N- ----ASF-
ἁγιωσύνη ἁγιωσύνης N- ----GSF-
ἀγκάλη ἀγκάλας N- ----APF-

Anyway, I hope to have more to share on this front later, but this just tickled my fancy.

Fun with LXXM-Corpus

Once I have a text available for natural language processing, there are a few basic tasks I like to perform to kick the tires. First, I like to run the collocations method of NLTK, which gives common word pairs from the text. For the LXXM, here are the results:

  • ἐν τῇ
  • ἐν τῷ
  • ὁ θεὸς
  • τῆς γῆς
  • καὶ εἶπεν
  • λέγει κύριος
  • ἀνὰ μέσον
  • τὴν γῆν
  • τοῦ θεοῦ
  • ὁ θεός
  • τάδε λέγει
  • πρός με
  • πάντα τὰ
  • ὁ βασιλεὺς
  • οὐ μὴ
  • οὐκ ἔστιν
  • τῇ ἡμέρᾳ
  • οἱ υἱοὶ
  • τῷ κυρίῳ
  • τοῦ βασιλέως

If you disregard the stop words, you can get a decent idea of the fundamental thematic content of the text.

Now for the silliness, using the n-gran random text generator:

ἐν ἀρχῇ ὁδοῦ πόλεως ἐπ' ὀνόμασιν φυλῶν τοῦ Ισραηλ παρώξυναν οὐκ ἐμνήσθησαν διαθήκης ἀδελφῶν καὶ ἐξαποστελῶ πῦρ ἐπὶ Μωαβ ἐν τῷ ἐξαγαγεῖν σε τὸν ἱματισμόν

The τελος of Greek natural language processing

I dream that someday we'll have a full stack of Greek natural language processing tools to facilitate research. These tools will range from transcribing the text to advanced NLP tasks like text classification or sentiment analysis. These tools will of course be open source.

Here is an overview of the components I have imagined (with notes where the tools are already in development):

  • Optical Character Recognition to transcribe the text to a digital form (Rigaudon Polytonic Greek OCR)
  • A user interface for editing the output of the OCR system (a "collaborative corpus linguistics" suite could be used for this and other editing tasks)
  • Collation of related texts for textual criticism.
  • Morphological analysis of the text (Tauber's greek-inflection is a start)
  • Tagging of the text based on above morphological analysis
  • Indexing the text
  • Use of a context-free Grammar or other means to produce syntactical analysis of the text (e.g. syntax trees)
  • A database to store all of this information
  • An API to make this information accessible (towards which Open Scriptures has worked)

We're actually pretty close. And once the full stack is in place, it will greatly increase the speed at which new texts enter the research corpus. This influx of data will improve the results of research and lead to new applications.

Am I missing anything?

Greek WOTD - ὑπέρογκος

ὑπέρογκος

Meaning "extremely large" or "rather difficult." Spotted in Lamentations 1:9:

ἀκαθαρσία αὐτῆς πρὸς ποδῶν αὐτῆς οὐκ ἐμνήσθη ἔσχατα αὐτῆς καὶ κατεβίβασεν ὑπέρογκα οὐκ ἔστιν ὁ παρακαλῶν αὐτήν ἰδέ κύριε τὴν ταπείνωσίν μου ὅτι ἐμεγαλύνθη ἐχθρός

Early Christian Writings

Early Christian Writings is an index of pre-Nicean Christian texts. It includes links to texts and translations (where available), as well as commentaries, and all of the works are tied up in a chronology.

Published:
Category: links Tags: Greek

Greek WOTD - στραγγαλόω

στραγγαλόω

Meaning what it sounds like, to choke or strangle. Found this in Ignatius' letter to the Trallians, 5.1.

This is the first in a possible series of Greek Word of the Day posts.

Ignatius and the improperly declined adjective

Update: I was inattentive to the fact that ἀξιόθεος is a two-termination adjective, and therefore masculine and feminine forms are morphologically fused. Thanks to Jack for pointing this out in the comments.

I just started reading Ignatius' letter to the Trallians and came across an oddity. In the opening sentence, Ignatius notes that the church to which he is writing ἐκκλησίᾳ) is "worthy of God" (ἀξιοθέω). Since this adjective is modifying a feminine noun, I would expect the feminine form here. It happens to stick out like a sore thumb because  Ignatius is using many adjectives to describe the church and they are all in feminine form (ἠγαπημένῃ, ἁγἰᾳ, κτλ), save this one.

At first I remarked to my wife: "there's a grammatical mistake in this sentence." But after considering it for a minute, it seems most likely that Ignatius put this adjective in the masculine form due to θεός being statically male in the Christian tradition. It probably "sounded wrong" to use the expected feminine ending.

I checked BDAG and found that there are attestations of ἀξιόθεος being declined for a ~~feminine~~ noun (Update: actually neuter in this case), including in Ignatius' letter to the Romans, 1:1. I am not sure if this is due to Ignatius' own idiosyncrasies, or due to transmission or spelling normalization by an editor. At the very least it was not a hard-and-fast rule, since we see both forms by the same author.

Additionally in the salutation to Romans, Ignatius does this same switching back and forth between feminine and masculine forms in describing the church. Perhaps someone has already done some research on this phenomenon, or perhaps there is a paper waiting to be written on this topic.

Published:
Category: language Tags: Greek

Rigaudon: Polytonic Greek OCR

I came across a very exciting project recently: Rigaudon. This is a polytonic Greek OCR system which has already been used on 532 texts of antiquity. The result as CC-BY-SA licensed, and the code is GPL v2, and available in a git repo. Bruce Robertson, one of the collaborators behind the project, also has other repositories, including one for a web-based interactive syntax tree editor. Check them out.

Transcription is the great boundary between the source texts and boundless application in the digital realm. A good polytonic Greek OCR system will unlock many texts which have never been digitized. This has a dual benefit: a "clean" transcription process can lead to permissive licensing for public domain works, and as a result, we'll all have a lot more texts for research.

The system is not perfect, but it is a work in progress and improvements can be made. Nonetheless, some manual editing will be required. However, these OCR results are the best I have seen for polytonic Greek. And the potential reward is so vast, I cannot help but get excited and get involved. There is already some correspondence circulating about collaborating around a particular text, which could then lead to morphological tagging and syntactic analysis, and maybe more.

In 2011 I wrote that the future is brightfor copyright issues in Christianity. This is just one example of how that is so. Free software licenses for code and permissive licenses for content are becoming the norm in the cutting edge of the field. This is good for everyone, but there is still a lot of work to do (and maybe more than ever).

Unicode conversion of the LXX Morph text

I had been looking for a morphologically-tagged LXX for research and
came across the CATSS LXXM text. The one thing lacking for my use of
this text was that it was in betacode and not in unicode.

By searching I have found that many people have taken this text and
converted it to unicode for embedding in web sites, but to my knowledge
nobody is publishing the equivalent plain text files. The Unbound Bible
comes closest, but it publishes the text and the morphological analysis
in two separate files, which is suboptimal. So I decided to embark on
converting the LXXM to unicode.

Luckily James Tauber has shared a Greek betacode to unicode conversion
script which took care of most of the hard work for me. Using this,
I was able to convert all of the texts to betacode to unicode. I am
sharing the result as a git archive: lxxmorph-unicode.

The texts differ from the originals in the following ways:

  1. Several corrections have been applied.
  2. The betacode text has been converted to unicode.
  3. The files are now whitespace-separated rather than fixed-width.
  4. The second column, containing the POS and parsing information, has had its whitespace replaced with hyphens in accordance with the above.
  5. Combined the split files of Genesis, Psalms, Isaiah, Jeremiah, and Ezekiel, and renumbered all the files.

Please note that this resource has a rather novel license which
requires users to fill out a user declaration and send it in to the
CCAT program at the University of Pennsylvania (see
0-user-declaration.txt in the repo). As far as I can tell, my redistribution of the unicode version complies with the license. I have contacted Robert Kraft (the former steward) and Bernard Taylor (the current steward) with the corrections I've found.

(link to the original announcement on the Open Scriptures mailing list)

Categories

Tags