The Library Basement
Reading under ground

Tag technology

Separate php-fpm pools for great victory

Let's say you use a pretty standard Nginx/PHp-fpm/Linux/Mariadb ("nephilim"?) stack for hosting web applications. On most distributions you'll have a single php-fpm pool which spawns workers to execute tasks handed to it by the web server, either via a POSIX or TCP socket. That's great for simplicity's sake.

But what if you have some web-app you want to run but don't really trust. HINT: You shouldn't really trust any internet facing application. If there's a remote code execution flaw in the code for webapp foo, an attacker then assumes the security persona of the entire php-fpm pool, including access to other applications' memory, file-system space, and databases. Yikes!

Nothing in the below is particularly novel, but it may be useful nonetheless. There's also the container approach to solving this, which is probably more secure overall, but is not available to everyone. The context of the examples below is running GNU Social on Centos 7. (On Centos 7, nginx runs as the "nginx" user, and php-fpm runs by default as the "apache" user, the same as httpd normally runs as).

Separate Databases

Each application should have its own database with its own unique username and password. I think most people know this, but stating it here for good measure.

Separate User

Each web application should have its own local unprivileged user account. If that account never needs a shell environment, it is best to not give it a login shell either. GNU Social requires a shell to run its queue daemon scripts, so here is how I did it:

useradd -m -s /bin/bash social

Assuming you have "PermitEmptyPasswords no" in your sshd_config, you don't have to set a password. Otherwise set a very strong one. It'll never be used under normal operations.

A note specific to GNU Social: the queue daemons should run as this user as well. We're in a systemd world now on Linux, so see an example of a unit file for queue daemons. You'll want to set the user to set:

User=social
Group=social

Separate File-system Path

Take note of the group your webserver (nginx in my example) runs as. In Centos it is "nginx", in Debian derivatives it appears to be "www-data".

You've extracted your web application's files into /var/www/social.example.com/ . You'll want to lock this down so that only the application pool user and the webserver can have access:

chown -R social:nginx /var/www/social.example.com/
chmod -R o-rwx /var/www/social.example.com/
# Also follow GNU Social's install instructions for setting
# write permissions on avatar/ file/, and the base directory so
# config.php can be written by the installer

This way the web server can read the application's root contents (e.g. php scripts and static files), and the php-fpm pool for your application will have write access (for writing the configuration at install time and uploading files). Other users should have no access to this location (go ahead, test it).

When you create a separate php-fpm pool below, you'll need to provide a session and cache path which are writable by the social user:

mkdir -p /var/lib/social/{cache,session}
chown root:social /var/lib/social/{cache,session}
chmod 770 /var/lib/social/{cache,session}

Separate php-fpm pools

Pools for php-fpm.d are typically found in /etc/php-fpm.d/. Your mileage may vary based on distribution, etc. Take a look at the default pool to see how it is configured.

Depending on the resources of your system, you may want to reduce the value of pm.max_children (and relate settings) to make room for your new pool. This can be tuned depending on the relative resource demands of your pools.

Now copy the default pool to a new file in the same directory called social.conf and edit it. Below are the required edits:

  1. Give the pool a unique socket, either a different path for a POSIX socket, or a different port number for a TCP sockets. Assuming everything is on a single server, I recommend the POSIX socket, e.g. "listen = /var/run/php-fpm-social.sock"
  2. Set "user = social"
  3. Set "group = social"
  4. Set "php_value[session.save_path] = /var/lib/social/session"
  5. Set "php_value[wsdl_cache_dir] = /var/lib/social/cache"

Configure your nginx configuration file for the site to use the unique socket listed above:

fastcgi_pass unix:/var/run/php-fpm-social.sock;

Now you are ready to restart php-fpm and nginx and your queue daemons. If you run the following, you should see some php-fpm workers running as social:

ps aux | grep php-fpm

If there is trouble, there are a few places you'll want to look:

  • nginx error log
  • nginx access log
  • php-fpm error.log
  • php-fpm www-error.log

Assuming that worked, you've got a separate, more-secure install of GNU Social. I did the foolish thing and changed the configuration after installing the site. I don't recommend it, unless you want an exercise in rapid troubleshooting. ;-)

Published:
Category: ktl Tags: technology

koine-nlp release

Today I am formally releasing koine-nlp 0.2, a Python library for common NLP-related tasks for Koine Greek. I decided to make a fancy koine-nlp homepage with the help of sphinx. It includes info on installation, a tutorial, and an API reference for the koinenlp module. You can find the source repository on my gogs instance.

In the most basic mode of operation, koine-nlp is used to prepare polytonic Greek text for indexing by normalizing. This done by means of the omnibus normalize() function:

>>> import koinenlp
>>> koinenlp.normalize("καὶ ἡ σκοτία αὐτὸ οὐ κατέλαβεν.")
'και η σκοτια αυτο ου κατελαβεν'

There's plenty more to it - see the documentation for more.

I do plan on adding some features in the future, so watch this space.

A categorized, tagged Septuagint corpus

Last year I created a version of the SBLGNT for use as categorized, tagged, corpus for natural language processing. Now I have done the same with a Septuagint text. I am calling it LXXMorph-Corpus. The source for text and tags is my unicode conversion of the CATSS LXXMorph text. There is at least one category for each file.

The text is arranged with one book per file. Certain books in the source LXXMorph text are split where there is significant textual divergence (manuscript B and A, or the Old Greek and Theodotion). Each file has one or more categories (e.g. pentateuch and writings).

Since there is no punctuation in the source text, the files are laid out with one verse per line. A better arrangement from an NLP perspective would be one line per sentence (thereby preserving the semantic structure). Maybe someday we'll have a freely-licensed LXX text which will include sentence breaks.

Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
    try:
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
    except:
        return tag

lxx = CategorizedTaggedCorpusReader('lxxmorph-corpus/', 
    '\d{2}\..*', encoding=u'utf8',
    tag_mapping_function=simplify_tag, 
    cat_file='cats.txt')

Now through the lxx object you have access to tagged words - lxx.tagged_words(), simplified tags - lxx.tagged_words(simplify_tags=True), tagged sentences - lxx.tagged_sents(), and textual categories - lxx.words(categories='former-prophets').

This is a derivative work of the original CATSS LXXMorph text, and so your use of it is subject to the terms of that license. See the README file for more details.

Spigot 2.2

I have released Spigot 2.2. The primary purpose of this release is to support the use of any arbitrary field in the incoming feed in the format of the outgoing message. Before Spigot limited you to the title or link, but now you can have more options, including author, etc.

This update requires a database schema change as well as an update to your configuration file. The new version will prompt you to upgrade these if necessary. I have provided an upgrade script in the git repo to handle this upgrade for you. New users have nothing to worry about.

Rigaudon: Polytonic Greek OCR

I came across a very exciting project recently: Rigaudon. This is a polytonic Greek OCR system which has already been used on 532 texts of antiquity. The result as CC-BY-SA licensed, and the code is GPL v2, and available in a git repo. Bruce Robertson, one of the collaborators behind the project, also has other repositories, including one for a web-based interactive syntax tree editor. Check them out.

Transcription is the great boundary between the source texts and boundless application in the digital realm. A good polytonic Greek OCR system will unlock many texts which have never been digitized. This has a dual benefit: a "clean" transcription process can lead to permissive licensing for public domain works, and as a result, we'll all have a lot more texts for research.

The system is not perfect, but it is a work in progress and improvements can be made. Nonetheless, some manual editing will be required. However, these OCR results are the best I have seen for polytonic Greek. And the potential reward is so vast, I cannot help but get excited and get involved. There is already some correspondence circulating about collaborating around a particular text, which could then lead to morphological tagging and syntactic analysis, and maybe more.

In 2011 I wrote that the future is brightfor copyright issues in Christianity. This is just one example of how that is so. Free software licenses for code and permissive licenses for content are becoming the norm in the cutting edge of the field. This is good for everyone, but there is still a lot of work to do (and maybe more than ever).

Spigot 2.0

I updated Spigot to work with pump.io! In addition, the package is now included in pypi, so you can install it with pip:
pip install spigot

Then you'll have spigot.py on your path. Running spigot.py the first time will prompt you to configure one account and one feed. Use spigot.py --help for all options, and check out the project page for more info. Git repo is here.

A categorized, tagged Greek New Testament corpus

I have published a categorized, tagged Greek New Testament useful for natural language processing. I am calling it sblgnt-corpus. The text comes from the SBGNT and the morphological tags come from the MorphGNT project.

The text is broken up with one book per file. Each file has one or more categories (e.g. gospel and pauline). In the files there is one sentence (not verse) per line. Sentences are demarcated by punctuation . ; and ·. This makes it easy to tokenize sentences by splitting on newlines. Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here is an example:

εὐθυμεῖ/V-3PAIS τις/RI-NSM ;/;
ψαλλέτω/V-3PADS ./.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
    try:
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
    except:
        return tag

sblgnt = CategorizedTaggedCorpusReader('sblgnt-corpus/', 
    '\d{2}-.*', encoding=u'utf8',
    tag_mapping_function=simplify_tag, 
    cat_file='cats.txt')

Now through the sblgnt object you have access to tagged words - sblgnt.tagged_words(), simplified tags - sblgnt.tagged_words(simplify_tags=True), tagged sentences - sblgnt.tagged_sents(), and textual categories - sblgnt.words(categories='gospel').

That should be enough to kickstart the exploration of the Greek New Testament with natural language processing.

Announcing Spigot

I am a StatusNet user (find me @nds on Identi.ca), and from time to time I like to automatically share links to blog posts and other syndicated content on my account. There are some excellent tools to do this, but I was left wanting. Tricklepost has the great feature of being able to limit the rate at which posts come through. However it has the rather onerous requirement of a full-blown mysql install and is not very flexible. Brdcst.it is a service which can post to many types of accounts (not just StatusNet), but it is also a full web application and does not have rate-limiting. I wanted something fairly light which I could run on my local system as a cron job.

So I wrote my own. Well, sort of: I stood on the shoulders of the Identicurse project, which had already implemented Python bindings to the StatusNet API.

Spigot is a rate-limiting feed aggregator to StatusNet accounts. It requires only Python 2.6 or greater, standard library modules, and a couple of 3rd-party modules which are easily accessible via the Python package index. It can poll an arbitrary number of feeds and post to an arbitrary number of accounts. You can set a minimum interval between posts for each feed. See the project page for information on obtaining and running it.

This is the first major Python project I've seen from start to finish, so I am fairly proud. Please contact me with any bugs or requests or patches.

Polytonic Greek in Dvorak layout for Linux

I type with the Dvorak keymap. I also type polytonic Greek, which in every operating system I use is based on the Qwerty keymap. So it gets very confusing and annoying to switch back and forth.

So I modified the Linux Xorg Greek keymap to correspond to the English Dvorak layout. And I'd like to share.

  1. Download the modified keymap.
  2. For steps 3-5 you'll need root privileges, so use sudo or su to get them.
  3. Find your kxb symbols directory - On Debian based systems it is in /usr/share/X11/xkb/symbols, but it has also been placed in /etc/X11/...
  4. Backup your existing Greek layout by copying "gr" from that folder to a safe place.
  5. Remove the ".txt" extension from the modified keymap you downloaded and place the file in your xkb symbols directory.
  6. From a terminal issue this command to "reset" your keymap to normal:
    setxkbmap -layout us -variant dvorak
    This doesn't do anything, but you'll want that in your terminal history so you can get back to regular English Dvorak.
  7. Issue the following command to make the keymap active:
    setxkbmap -layout gr -variant dvpoly

Now you are typing polytonic Greek with a Dvorak layout. Now, as others have noted, this does not have the spirit of the Dvorak keymap. It is not constructed based on actual usage of Greek, and it will not be any more efficient than the Qwerty-based layouts (and given the placement of the accent keys in my layout, it may actually be worse). The purpose of this keymap is purely to make it easier for English Dvorak typists to transition to polytonic Greek.

The primary departure I made from the English Dvorak paradigm was moving the semi-colon so that the Greek accent dead keys could be closer to one another. In this keymap, the "semi-colon" (actually Greek question mark) can be typed using the "Q" key on the Qwerty layout, and the acute and grave accents can be typed using the "Z" and "X" keys respectively (which are ";" and "q" in the Dvorak layout). Here is a screenshot of the layout for a more explicit reference.

The xkb keymaps are under the MIT license. I used this non-polytonic layout as a guide for my work. Please post with comments, questions, bugs, etc. For more info on how to type polytonic Greek in Linux, see this excellent post on B-Greek. It references the Qwerty-based layout, but the same principles apply.

Random Genesis

I have just begun working through Natural Language Processing with Python. One of the first features highlighted in the first chapter is the ability of nltk (the Natural Language ToolKit - a Python module) is to generate random text from a corpus.

Without further ado, here is what my system generated based on the book of Genesis in the KJV:

In the selfsame day , neither do thou any thing that creepeth upon the bank of the east wind , sprung up after th And I will send thee a covering of the Philistines unto Gerar . And he commanded the steward of my master greatly ; and she bare unto Jacob , went forth to go down and buy thee fo But if thou be in the second , and fall upon Adam , in the land is good : and his two womenservants , and begat sons and his eleven sons , and put every man ' s

Sound realistic? ;-)

Natural Language Processing with Python

I was browsing through a local bookshop's computer section recently and saw a title which instantly grabbed my attention: Natural Language Processing with Python. It was a bit more expensive than I wanted to pay at that moment, but I thought I may save up.

As it happily turns out, the entire book is available online under a Creative Commons license (BY-NC-ND). This is the sort of thing which makes me really happy. I am going to be checking it out, and if it is useful enough, I may buy to paper copy to thank the authors and O'Reilly for publishing such a great book.

The book is focused mostly on the Natural Language Tool Kit (nltk) Python module, which is available under an Apache license. I had never used it before, but it looks fairly capable. I must admit I was somewhat surprised that Google finds relatively few pertinent results when searching for "nltk new testament greek" or "nltk biblical studies." The library seems quite suited to the field, so I am surprised it is not more popular among Bible scholars. If nltk is any good, I intend to change that.

The Best Laid Plans (BibleTech 2011)

I had planned to attend the BibleTech conference this weekend. However Elias came down with his first-ever illness, so I thought it best to cancel my trip and help take care of him. As it happens, Weston's trip to the conference was also cancelled due to family exigencies, and so he was not able to give his talk about Open Scriptures.

I am pretty sad, because I have really been looking to going to this conference for a few years now. Oh well, better luck next year!

Published:
Category: meta Tags: technology

Memory Verse Rank update

I've updated the Memory Verse Rank site to what will probably be its final form. Here are the basic changes I made:

  • Changed the color scheme and fonts, and added an icon.
  • Changed some instances of "rank" to "rate" to make a semantic distinction between the act of rating a verse and the overall rankings.
  • Now drawing from among all the Bible books, instead of just Ephesians.
  • Instead of trying to choose a random Bible verse, I will be using the ESV API's built-in "random" verse function. It returns a passage from a pre-picked set. This has two advantages for the site: 1) it solves some technical difficulties in implementing random verse selection from such a large dataset, and 2) it will keep the pool of memorable verses from becoming too diluted.

Someday I made add support for user accounts, but not for now. I'll just let the site run and see what happens. Tell your friends.

Memory Verse Rank

I have just launched [Memory Verse Rank][]. The premise is simple: I show you a Bible verse, you tell me if it is a memory verse. What's a memory verse? Use whatever criteria you like, but start with whether or not you've ever actually memorized it. I am not entirely sure where the data will lead, but it will be interesting to analyze the properties of memorable and non-memorable verses. I also hope to have all kinds of stats like "most memorable book," etc. Right now I am testing with just Ephesians. I still need to work out a better way to pull random verses from the [ESV API][], which I use for the scripture backend. Please keep an eye out for bugs, and send me any improvements you would like to see.

Huck Finn: Python edition

Someone proposed a Kickstarter project to replace the "n-word" with "robot" in Huck Finn. The project is in the vein of other recent humorous edits of literature in the public domain, though they claim to have an altruistic goal - to get the redacted version of the story back into the hands of kids everywhere. You can even get your name added as a minor character to the book if you donate enough to the project! Let's call it "benevolent censorship". Or maybe "the rape of the public domain." Actually, best not to describe it at all.

The scope of the project includes comissioning an introduction, altering illustrations, and editing the text. Well, I can help with one part of that. The full text of The Adventures of Huckleberry Finn is available on Project Gutenberg. I've written [a short Python script to replace every instance of the n-word with "robot."][]

Just run it and then you'll have a nice txt file of the robot edition. There, now the editing part is done. Can I get my name added as a townsperson?

[a short Python script to replace every instance of the n-word with "robot."]: http://thelibrarybasement.com/images/2011/02/robot_huck_finn.py_.txt

Categories

Tags