I dream that someday we'll have a full stack of Greek natural language processing tools to facilitate research. These tools will range from transcribing the text to advanced NLP tasks like text classification or sentiment analysis. These tools will of course be open source.
Here is an overview of the components I have imagined (with notes where the tools are already in development):
- Optical Character Recognition to transcribe the text to a digital form (Rigaudon Polytonic Greek OCR)
- A user interface for editing the output of the OCR system (a "collaborative corpus linguistics" suite could be used for this and other editing tasks)
- Collation of related texts for textual criticism.
- Morphological analysis of the text (Tauber's greek-inflection is a start)
- Tagging of the text based on above morphological analysis
- Indexing the text
- Use of a context-free Grammar or other means to produce syntactical analysis of the text (e.g. syntax trees)
- A database to store all of this information
- An API to make this information accessible (towards which Open Scriptures has worked)
We're actually pretty close. And once the full stack is in place, it will greatly increase the speed at which new texts enter the research corpus. This influx of data will improve the results of research and lead to new applications.
Am I missing anything?