Coding of Greek in MW

This document describes how references to Greek words have been encoded in the Cologne edition of the Monier-Williams Sanskrit-English dictionary. The result of this work may be seen in two places:

The starting point of this work was an earlier (2007) edition of the xml form of MW in which Greek was embedded in the Beta encoding. This work was done by Wendy Teo under the supervision of Peter Scharf and Malcolm Hyman. In this earlier edition, the position of text in the Greek alphabet had been indicated in Malten's initial coding by a placeholder character, '$'; text in other alphabets, such as Arabic, had been similarly indicated. The task was thus reduced to examining all such positions in relation to the scanned image of the page; for those instances identified as Greek, the placeholder character was replaced by a standard transliteration of the Greek. This transliteration was delimited by an identifiable character string, so that subsequent work could identify the instances of Greek transliteration.

The next step was to integrate this encoding of the earlier MW edition into the current database representing MW. I chose to replace the still-present placeholder character with an abstract element <gk>n</gk>. The content of this element is an index number for the MW record. For instance, <gk>1</gk> represents the 1st Greek word in a given record, <gk>2</gk> represents the 2nd Greek word in a given record, and so forth.

Then, a separate database table, mwgreek, was created. It contains two fields:

  1. lnum: the unique MW record number (the contents of the <L> element)
  2. data: A text string with an idiosyncratic encoding of the sequence of Greek words found in the given MW record.
    If there are multiple Greek words for a given L-number, the multiple occurrences are separated in the mwgreek data element by the string "<gk>". For each occurrence of Greek , three pieces of information are stored; these three pieces are separated in the mwgreek by the string "<e>". For instance, with L=4, there are two instances of Greek. The mwgreek table entry for this is coded as
    A)<e>A<e>ἀ<gk>A)N<e>A%29N<e>ἀν
    1. The beta code of the Greek word
    2. The unicode representation of the Greek word, based on the beta code
    3. An appropriate reference to the Greek word using a web service provided by the Perseus Digital Library. For some words, this reference is the same as the beta code. It is known that for some words a different spelling, in beta code, is required to access the desired Perseus data.

The transcoding of beta code into unicode for the Greek was accomplished by creating a data table, 'beta_greek.xml'. This table is processed in the normal way by the transcoding routine developed in Java by Ralph Bunker, and recoded in PHP. A full account can be seen by pressing the 'alphabet' button. This table was developed on the basis of the EpiDoc open source project, which was identified by Gregory Crane, director of the Perseus project, in a communication to Scharf. Specifically, the beta_greek.xml file was based on an examination of the files BetaCodeConvert.properties and UnicodeCConverter.propertiies. The first file corresponds beta codes (letters and diacritics) to property names (such as 'A = alpha', '*A = Alpha'), and the second file corresponds these property names to Unicode code points (such as 'alpha = \u03B1', and 'Alpha = \u0391'). Another online reference describes how diacritics are added to letters (they follow the letter). Putting these pieces together seems to give a fairly good representation of Greek unicode from Beta transliteration.
Two implementation blemishes, introduced by this implementation, of which I am aware are:

Although the results as now constituted are acceptable, there are a few places where further attention could lead to improvements:

Jim Funderburk
Last modified: January 9, 2010