The Library Basement
Reading under ground

Tag Open Scriptures

Strong's Dictionary in sqlite3

Someone asked the Open Scriptures mailing list about getting the Strong's Dictionary data into a sqlite3 database. Challenge accepted. And it was quite the challenge.

The Strong's repo for the Open Scriptures project contains an xhtml version of the Strong's dictionary. I would have used that data as a source, but for two problems: 1. it lacked transliterations, and 2. some of the unicode lemmas for the Hebrew portion were missing. Thankfully the repo also contains the XML sources for the Greek and Hebrew. I decided to unleash Python with xml.sax.

Unfortunately those XML sources were two different data types, so I had to write two different parsers. Also, the Greek portion contains self references with just the number, not the unicode string, so I had to write a second pass parser to fill in the missing lemmas. It also turns out that some of these self references are to Strong's numbers which are not a part of the dataset, which has me a bit perplexed (I'll be following up on that soon).

After changing my mind a few times about how I wanted to approach the "description" part of each entry (and some accompanying refactors of the code), I finally got a working product. You can find it in my Biblical Studies git repo. I put it under the MIT license so people can do whatever they need to do with it.

I am not really sure if there are any other open sources Strong's->sql importers out there. Maybe someone can take my script and give it support for other databases (or even frameworks, like Django).

Edit: And Darrell Smith provided code for doing it with regex in PHP. Technology can provide many paths. Glad to see there are so many helpers on the Open Scriptures mailing list.

Update: I've update the script to use 1.5 of the Strongs Greek XML, and it also downloads the source files automatically, so you don't have to checkout the Open Scriptures git repo if you don't need it otherwise.

Update 2 (March 10, 2012): The MorphGNT site was moved to Github, so I've updated the link to the Strongs Greek database in the script. Also, here is a compressed copy of the sqlite3 database which results from the script.

XML and the Bible

While working on an importer to bring the SBL Greek New Testament into Open Scriptures, I noticed some interesting features of the SBLGNT XML file. (I promised that I would try to exclude posts of a technical nature from this blog, but I am breaking that promise, because I think this technical discussion is interesting and applicable to Biblical studies.)

The SBLGNT's XML representation of the Biblical text makes an interesting distinction between tags which have child elements and childless tags. That is, normal XML tags encompass the actual Greek text and its structures (such as paragraphs and books), while childless tags represent insertions which are not original to the text. Here is a truncated Matthew 1:1 in the SBLGNT XML as an example:

<book id="Mt">

\<title>ΚΑΤΑ ΜΑΘΘΑΙΟΝ\</title>


\<verse-number id="Matthew 1:1">1:1\</verse-number>


\<suffix> \</suffix>



\<suffix>. \</suffix>


Notice how there is no "verse" tag which encompasses all of the included text. Instead "verse-number" is a tag which is inserted where ever the verse breaks are located. This is opposed to the "p" (paragraph) tag, which encompasses all of the child "w" (word) and "suffix" (spaces and punctuation) tags. Paragraphs are of course present in the original biblical text.

One thing I might have done to take this principle even further would be to put the Book titles where they appear in the Greek manuscripts. In SBLGNT XML, the title is always the first child element of the "book" tag. However, that is not always where the title was in the manuscripts. Sometimes it was printed at the end of the book.

I like the distinction between textual forms and externally imposed structures as reflected in this XML document. I'm not sure what Logos' exact thinking was behind these design choices, but I think I've identified it.

SBL Greek New Testament Licensing

A lot has been said already about the SBL Greek New Testament text (SBLGNT) which has just been released. It is not every day that a new eclectic text is released. I think that what's just as noteworthy as the release itself is the license under which the text is distributed.

As has been discussed on the Open Scriptures mailing list, the license is fairly permissive, and allows for gratis distribution. In Creative Commons terms, it is essentially a Attribution/Non-commercial license. Actually the non-commercial part is qualified, because it is allowed in sold works under certain circumstances. There are a few curious features of the license I'd like to address.

Derivative Works?

The text of the license is silent on whether it is permissible to distribute modified copies of the SBLGNT. For example, am I allowed to "fork" the text and distribute it with my own changes? It's not clear to me from the license. The license does specifically disallow one type of derivative work (I'll discuss that below), so the implication is that derivative works are otherwise OK.

Reporting Requirement

One of the requirements for qualified commercial distribution is reporting:

If you give away the SBLGNT for use with a commercial product or sell a print or electronic work containing more than 500 verses from the SBLGNT, you must annually report the number of units sold, distributed, and/or downloaded to the Society of Biblical Literature’s Rights and Permissions Office.

This is not an overly harsh measure, in my opinion, and I can certainly understand why the require it. But ideally a license would have no reporting requirement. It's just a little extra burden.

English Diglots

As I mentioned above, the SBLGNT license has one prominent exception for redistribution: if you are going to distribute an English diglot, you need to obtain a separate license from SBL. Presumably this is because they are considering publishing their own English-Greek diglot. You can publish a diglot with other languages than English. I also wonder whether an edition with three or more languages including the SBLGNT and English are permissible. It is a bit of an odd stipulation for a license given that it is so particular, but it is not the worst thing in the world.


As I have written before, I think that the Christian scriptures should not have any copy restrictions placed on them. So I think it would be ideal for the SBLGNT text and apparatus to be released into the Public Domain. Barring that, I think I would recommend re-licensing under something like a Creative Commons Attribution/Non-commercial license which is a bit more established. That would iron out some of the ambiguities of the current license.

Here I've gone and looked the gift horse in the mouth. I am quite pleased by the release of the SBLGNT. Moreover, I think its license is one of the best available for a scriptural text. The permissive license instantly makes this text one of the most important available. It is definitely a step in the right direction. I believe we'll be working on getting the text imported into Open Scriptures before too long.