Marking Monier:
Current state of digitized Monier-Williams Dictionary
Jim Funderburk, Honesdale, Pennsylvania
Thomas Malten, University of Cologne
(May, 2008)
IITS web site
These remarks were prepared for a presentation at the
Second International Sanskrit Computational Linguistics Symposium
.
This document, with its associated references, describes work on digitization of Sanskrit dictionaries. The work to provide an XML encoding of the Monier Williams Sanskrit-English Dictionary was done during the last two years in a collaboration between the Institute of Indology and Tamil Studies (IITS) of the University of Cologne and Brown University's Sanskrit Department. Jim Funderburk worked on the XML encoding of MW with the extensive collaboration and guidance of Thomas Malten at IITS and Peter Scharf at Brown. Malcolm Hyman, Susan Rosenfield, and Ramaswamy Chandrashekar have also contributed to this work.
History
Malten's 1997 review of the Cologne Digital Sanskrit Lexicon project
(CDSL) provides a succinct description of an ambitious project:
The Cologne Digital Sanskrit Lexicon (CDSL) project undertakes to digitize and merge
the major bilingual Sanskrit dictionaries compiled in the 19th century. Its aim is to
provide a basic lexical corpus to provide an easy access to all available meanings of
Sanskrit words and to allow the creation of a number of computer programs that will
help to analyze Sanskrit texts.
In the first stage Monier-William's Sanskrit-English dictionary (MW) has been digitized
to be followed at a second stage by three other dictionaries (Cap, PW2 and Sch). All
these will be structured and unified to allow access to the meanings as developed by the
different lexicographers.
In a 2005 document pertaining to the NSF funded project 'International Digital Sanskrit Library Integration', Scharf describes the role such digital lexica as those of Malten might play in a system of textual analysis:
In order to analyze forms in Sanskrit texts a parser must be combined with a database
of lexical stems. The lexical sources described above in
section III should be sufficient for producing the lexical component of a basic
morphological analyzer/generator for Sanskrit.
With a completed lexical database and morphological generator, it is possible to
produce a full-form lexicon of Sanskrit, which maps every surface form onto a tuple (L,
M), where L is a lexical base and M is a set of morphosyntactic features. Morphosyntactic
features are indicated in accordance with the morphological tagging scheme published by
Scharf.
Phases of the coding of Monier Williams Sanskrit-English dictionary
For this discussion, the coding of the dictionary may be thought of as occurring in four phases: initial digitization, refinement into MONIER.ALL, conversion of MONIER.ALL into an equivalent XML form (MONIERhBU), correction and refinement into 'mwtab'.
initial digitization
The initial digitization of MW was accomplished by Malten and his
staff in Azhivaikkal in Thanjavur district, Tamilnadu, South India, and is described in the CDSL document. An
image shows a comparison between the
printed page (page 288, column 1) and an early digitized form of MW. There is a version of the entire dictionary coded in a manner very similar to that seen in the sample image. It was used in a PC dictionary lookup program designed by Claude Setzer about 2001, and is in the IITS download archive 'MonW2001'.
MONIER.ALL
Refinement by Malten of the initial digitization of MW resulted in a form that may be referred to by its file name 'MONIER.ALL' (see IITS download archive) This form is the starting point of the recent work. It is the basis of the MW display at the web site
Sanskrit, Tamil and Pahlavi Dictionaries and available for download via the IITS web site. There are many extended ascii codes used intentionally in this encoding, which when viewed with the appropriate settings in some text viewers make the entries fairly easy to read; for comparision with the above image, one may examine the coding of 'kuJjara' in MONIER.ALL.
MONIERhBU.txt
Several working principles guided the transformation from the markup present in MONIER.ALL to that present in MONIERhBU.
- Retain equivalence between MONIER.ALL and MONIERhBU. This was accomplished by EMACS lisp programs converting from one form to the other and back, and comparing the original with the converted and reconverted form.
- Have the target markup conform to XML coding standards. In particular, the markup should be well-formed XML. In some cases this was simple to accomplish, such as putting </H3> at the end of a line that started with <H3>. In some cases this was difficult, such as converting the extended ascii characters representing '<' and '>' to '<c>' and '</c>' and similarly for parentheses and brackets, because the result was not well-formed XML. For such situations some programmatic surgery on MONIER.ALL was required. So, primarily for this reason, the first design goal had to be relaxed.
- Implement corrections to a backlog of known typographical mistakes in MONIER.ALL. For this purpose, and the more general purpose of having a systematic procedure for implementing and documenting other desired changes, a simple set of update programs based on 'change' files was developed.
- Represent Sanskrit with the SLP1 transliteration, where MONIER.ALL used the Harvard-Kyoto transliteration. The choice of SLP was useful for compatibility with some Devanagari display software developed by Malcolm Hyman and used on Scharf's web site. Conversion between SLP1 and HK as well as other transliterations such as ITRANS is relatively easy to program (see software below).
- Develop a web-based display of the marked data.
Current markup overview
The current markup is maintained as a mySQL table, 'mwtab', at the IITS web site. It extends the markup of MONIERhBU in several ways. In all of this work, it was valuable to have the technical ability to compare the scanned image of an entry in a dictionary page with the coding of that entry.
- Various kinds of errors were visible in MONIERhBU.
- data entry errors : Such errors may be defined as instances where the digitized text differs 'materially' from the original, such as when a word, Sanskrit or English or otherwise, is misspelled. Some such misspellings were found by comparison to standard word lists, such as of English or of Sanskrit (the MW dictionary list of entries). Many have been found by close readers of the web display, either we who work on the dictionary or users of the dictionary who communicate some error they notice. A web update program allows small changes to be made interactively by authorized users.
- factual errors Occasionally there are found instances where the dictionary itself appears in error. We have chosen to implement these changes, but have tried to maintain separate files noting such changes.
- transliteration errors These are also a type of spelling error, and may occur in text marked as Sanskrit, and in Anglicized Sanskrit text not (previously) marked as Sanskrit.
- coding errors The most prominent category of such errors involves the 'scope' of an XML element. For example the <ls> element encloses the text of a reference to a literary source; when such a reference includes chapter and verse from a particular work, all of this, and no more, should be enclosed by the marking element. Here the sequence of text is not typographically distinguished from other text in any way universally applicable throughout the text, but only in certain micro-contexts. Other examples include the scope of the <lex> element.
- Various incomplete markup in MONIER.ALL was still present in MONIERhBU.
- unsplit lines For nominal words, the decision was made to have a record correspond to a sense of the word; senses are separated typographically in the dictionary by a semicolon. Instances of records incompletely split had been identified in MONIER.ALL by certain coding conventions. Completing the splits was carried out in 'mwtab'.
- Monier's 4th line Monier delineates four 'lines' of words; the head words of the third and fourth lines, compounds and sub-compounds, are expressed incompletely in the dictionary. Whereas MONIER.ALL had resolved the headwords for the third line, this resolution for the fourth line was known to be incomplete. The coding of mwtab completed the list of head words to include this fourth line of words.
- Anglicized Sanskrit Many Sanskrit words, predominantly proper names, appear in the text as capitalized words in an 'indological' font. MONIER.ALL represents such words in a specially devised transliteration, using upper and lower case letters with numbers used to distinguish vowel length, and consonant type. The presence of a digit in a capitalized word identifies it as likely Anglicized Sanskrit, but fails to identify as Sanskrit words such as 'Arjuna' (which can also be part of a plant name) or 'Allaha1ba1d' which is not a Sanskrit word. Since it was felt useful to identify those words which were Sanskrit (so a user could know this fact and look them up in the dictionary, too), they have been marked in a certain way in mwtab. Similarly, botanical and biological scientific names have been marked. Both such markings involve 10s of thousands of instances.
- Due to its importance for the development of a full-form lexicon, the coding of inflectional information has been notably refined. This refinement has proceeded to a usable stage for substantives and indeclineables, and to an intermediate stage for verbs. The coding conventions are describe in the detailed markup reference under the '<lex>' and '<vlex>' elements. For nominals and indeclineables, a lexical grammar XML file has been prepared from the dictionary, and used for inflection generation.
Current markup reference
A detailed discussion of the current markup (as of 2008) of mwtab is available for reference.
The mwtags document contains the latest markup revisions.
Coding yet to be done
While it is felt that the current state of markup of the Monier Williams Sanskrit-English dictionary is adequate for many purposes, there are several areas which have been identified as candidates for useful improvements.
- Supplement A supplement of 'Additions and Corrections' of about 26 pages appears in the dictionary. This has been coded (as in MONIER.ALL) and some work done relating entries between it and the main part of the dictionary. This could be coded in a way compatible to mwtab.
- transliteration of Greek Many etymological relations between Sanskrit and Greek are mentioned throughout the dictionary. Currently, the Greek has not be transliterated and appears as a '$' symbol. A separate file with a transliteration of the Greek exists which could be incorporated into mwtab.
- botanical terms Over 15000 words are tagged as parts of scientific names of plants, and another 500 of animals. Linking these words to a currently accepted authoritative taxonomy database would be useful.
- verbal forms There are abundant examples of verbal inflections presented in the dictionary for roots. With the current coding, these examples are inaccessible; but there is enough apparent regularity to the examples to suggest the feasibility of a useful coding.
- literary sources Currently, one can get easy cross references from an abbreviated literary source in the text to the name of the work or author. However, it would be nice to have links to specific text from a work. This task probably requires, in general, as yet undigitized information. However, some sub-tasks, such as links to the Paninian references, might be doable.
Other Cologne digitizations
Several other dictionaries have been digitized. The digitized Cappeller Sanskrit-English dictionary has been integrated with MONIER.ALL in the Sanskrit, Tamil and Pahlavi Dictionaries web site. A preliminary XML coding of the Apte English-Sanskrit dictionary has been done and is available for word look-up via the IITS web site.
The list below provides links to samples of the various digitizations and accompanying scanned images. A goal of the CDSL project is to convert these into forms compatible with mwtab, so all dictionaries are available in a consistent encoding.
- Apte's Practical Sanskrit-English Dictionary, 1957 (protected under copyright)
(digitization, scan)
- Apte English-Sanskrit Dictionary, 1920
(digitization,
scan,
experimental TEI encoding
)
- H.H. Wilson Sanskrit and English Dictionary, 1832.
(digitization, scan)
- Boehtlingk Sanskrit-German Dictionary, 1879
(digitization, scan)
- Boehtlingk & Roth Sanskrit-German Dictionary, 1855
(digitization, scan)
- Cappeller's Sanskrit-English Dictionary, 1891
(digitization, scan)
- Cappeller Sanskrit Wörterbuch, 1887
(digitization, scan)
- Grassman Rig-Veda Dictionary, 1873
(digitization,
scan)
- Pune Dictionary (protected under copyright),1976-current
(digitization,
scan)
- Schmidt's additions to Boehtlingk Sanskrit-German Dictionary, 1929
(digitization,
scan)
- Burnouf Sanskrit-French Dictionary, 1866
(digitization1,
digitization2,
scan)
- Stchoupak, Nitti, Renou Sanskrit-French Dictionary, 1932 (protected under copyright),
(digitization,
scan)
- MacDonell Sanskrit-English Dictionary, 1893
(digitization,
scan)
- Borooah English-Sanskrit Dictionary, 1877
(digitization,
scan)
Cologne scanned editions
Eight Sanskrit dictionaries are currently available via the IITS web site in a form we refer to as 'scanned images'. This just means that the individual pages of the dictionaries have been scanned into images named in a certain consistent manner, and indexed by the first word on a page. For a user with fast internet access, a digitized edition provides access similar to that provided by a physical book. For the MW Sanskrit-English dictionary and the Apte English-Sanskrit dictionary, the developed web displays have links between individual words and the scanned editions; this link has proved so useful for MW that it is viewed as a desideratum for future displays of other digitized dictionaries.
Software tools and downloads
Several versions of Monier Williams Sanskrit-English dictionary are available in the download section of the IITS web site. If there is a need by users for parts of the software we have developed for maintaining, displaying, and otherwise using the digitized lexica of the IITS web site, we can make such software available (contact Thomas Malten).