MW Greek Info

Coding of Greek in MW

This document describes how references to Greek words have been encoded in the Cologne edition of the Monier-Williams Sanskrit-English dictionary. The result of this work may be seen in two places:

A list of the records of MW with corresponding Greek words. A few omissions are indicated by a '?'.
In MW, by looking up a word that has Greek words. The Greek words appear in Greek unicode, and each contains a link to the Perseus Word Study Tool.

The starting point of this work was an earlier (2007) edition of the xml form of MW in which Greek was embedded in the Beta encoding. This work was done by Wendy Teo under the supervision of Peter Scharf and Malcolm Hyman. In this earlier edition, the position of text in the Greek alphabet had been indicated in Malten's initial coding by a placeholder character, '$'; text in other alphabets, such as Arabic, had been similarly indicated. The task was thus reduced to examining all such positions in relation to the scanned image of the page; for those instances identified as Greek, the placeholder character was replaced by a standard transliteration of the Greek. This transliteration was delimited by an identifiable character string, so that subsequent work could identify the instances of Greek transliteration.

The next step was to integrate this encoding of the earlier MW edition into the current database representing MW. I chose to replace the still-present placeholder character with an abstract element <gk>n</gk>. The content of this element is an index number for the MW record. For instance, <gk>1</gk> represents the 1st Greek word in a given record, <gk>2</gk> represents the 2nd Greek word in a given record, and so forth.

Then, a separate database table, mwgreek, was created. It contains two fields:

lnum: the unique MW record number (the contents of the <L> element)
data: A text string with an idiosyncratic encoding of the sequence of Greek words found in the given MW record.
If there are multiple Greek words for a given L-number, the multiple occurrences are separated in the mwgreek data element by the string "<gk>". For each occurrence of Greek , three pieces of information are stored; these three pieces are separated in the mwgreek by the string "<e>". For instance, with L=4, there are two instances of Greek. The mwgreek table entry for this is coded as
```
A)<e>A<e>ἀ<gk>A)N<e>A%29N<e>ἀν
```
1. The beta code of the Greek word
2. The unicode representation of the Greek word, based on the beta code
3. An appropriate reference to the Greek word using a web service provided by the Perseus Digital Library. For some words, this reference is the same as the beta code. It is known that for some words a different spelling, in beta code, is required to access the desired Perseus data.

The transcoding of beta code into unicode for the Greek was accomplished by creating a data table, 'beta_greek.xml'. This table is processed in the normal way by the transcoding routine developed in Java by Ralph Bunker, and recoded in PHP. A full account can be seen by pressing the 'alphabet' button. This table was developed on the basis of the EpiDoc open source project, which was identified by Gregory Crane, director of the Perseus project, in a communication to Scharf. Specifically, the beta_greek.xml file was based on an examination of the files BetaCodeConvert.properties and UnicodeCConverter.propertiies. The first file corresponds beta codes (letters and diacritics) to property names (such as 'A = alpha', '*A = Alpha'), and the second file corresponds these property names to Unicode code points (such as 'alpha = \u03B1', and 'Alpha = \u0391'). Another online reference describes how diacritics are added to letters (they follow the letter). Putting these pieces together seems to give a fairly good representation of Greek unicode from Beta transliteration.
Two implementation blemishes, introduced by this implementation, of which I am aware are:

a final sigma is represented in the same way as a non-final sigma, rather than by a special character.
Addendum: The display of Greek words in MW was changed so that the special final-sigma character is presented. However, this logic is implemented by a 'kluge', rather than internally by transcoder.
a 'macron' is not represented. Teo coded this as '%26'. This was not mentioned in the Epidoc files that I examined.

Although the results as now constituted are acceptable, there are a few places where further attention could lead to improvements:

The few 'missing' Greek words need to be added by someone who knows both Greek and Beta Code.
The whole thing should be proofread by someone similarly knowledgeable. This might be done using the list of MW records with Greek words mentioned above.
The 'Perseus' codes should be altered where needed from their current default value, as there are many cases where doing a Perseus look-up on the default value yields no results. For instance, the beta code for the first Greek word of the first MW record containing a Greek word is 'A)' (alpha-lenis). Using 'A)' with the Perseus Word Study Tool yields "Sorry, no information was found...". However, there is information using 'A'. I added this Perseus key to the record of 'mwgreek'.

Jim Funderburk

Last modified: January 9, 2010