Marking Monier:

Current state of digitized Monier-Williams Dictionary

Jim Funderburk, Honesdale, Pennsylvania
Thomas Malten, University of Cologne
(May, 2008)

IITS web site

These remarks were prepared for a presentation at the Second International Sanskrit Computational Linguistics Symposium .

This document, with its associated references, describes work on digitization of Sanskrit dictionaries. The work to provide an XML encoding of the Monier Williams Sanskrit-English Dictionary was done during the last two years in a collaboration between the Institute of Indology and Tamil Studies (IITS) of the University of Cologne and Brown University's Sanskrit Department. Jim Funderburk worked on the XML encoding of MW with the extensive collaboration and guidance of Thomas Malten at IITS and Peter Scharf at Brown. Malcolm Hyman, Susan Rosenfield, and Ramaswamy Chandrashekar have also contributed to this work.

History

Malten's 1997 review of the Cologne Digital Sanskrit Lexicon project (CDSL) provides a succinct description of an ambitious project:

The Cologne Digital Sanskrit Lexicon (CDSL) project undertakes to digitize and merge the major bilingual Sanskrit dictionaries compiled in the 19th century. Its aim is to provide a basic lexical corpus to provide an easy access to all available meanings of Sanskrit words and to allow the creation of a number of computer programs that will help to analyze Sanskrit texts.

In the first stage Monier-William's Sanskrit-English dictionary (MW) has been digitized to be followed at a second stage by three other dictionaries (Cap, PW2 and Sch). All these will be structured and unified to allow access to the meanings as developed by the different lexicographers.

In a 2005 document pertaining to the NSF funded project 'International Digital Sanskrit Library Integration', Scharf describes the role such digital lexica as those of Malten might play in a system of textual analysis:

In order to analyze forms in Sanskrit texts a parser must be combined with a database of lexical stems. The lexical sources described above in section III should be sufficient for producing the lexical component of a basic morphological analyzer/generator for Sanskrit.

With a completed lexical database and morphological generator, it is possible to produce a full-form lexicon of Sanskrit, which maps every surface form onto a tuple (L, M), where L is a lexical base and M is a set of morphosyntactic features. Morphosyntactic features are indicated in accordance with the morphological tagging scheme published by Scharf.

Phases of the coding of Monier Williams Sanskrit-English dictionary

For this discussion, the coding of the dictionary may be thought of as occurring in four phases: initial digitization, refinement into MONIER.ALL, conversion of MONIER.ALL into an equivalent XML form (MONIERhBU), correction and refinement into 'mwtab'.

initial digitization

The initial digitization of MW was accomplished by Malten and his staff in Azhivaikkal in Thanjavur district, Tamilnadu, South India, and is described in the CDSL document. An image shows a comparison between the printed page (page 288, column 1) and an early digitized form of MW. There is a version of the entire dictionary coded in a manner very similar to that seen in the sample image. It was used in a PC dictionary lookup program designed by Claude Setzer about 2001, and is in the IITS download archive 'MonW2001'.

MONIER.ALL

Refinement by Malten of the initial digitization of MW resulted in a form that may be referred to by its file name 'MONIER.ALL' (see IITS download archive) This form is the starting point of the recent work. It is the basis of the MW display at the web site Sanskrit, Tamil and Pahlavi Dictionaries and available for download via the IITS web site. There are many extended ascii codes used intentionally in this encoding, which when viewed with the appropriate settings in some text viewers make the entries fairly easy to read; for comparision with the above image, one may examine the coding of 'kuJjara' in MONIER.ALL.

MONIERhBU.txt

Several working principles guided the transformation from the markup present in MONIER.ALL to that present in MONIERhBU.

Current markup overview

The current markup is maintained as a mySQL table, 'mwtab', at the IITS web site. It extends the markup of MONIERhBU in several ways. In all of this work, it was valuable to have the technical ability to compare the scanned image of an entry in a dictionary page with the coding of that entry.

Current markup reference

A detailed discussion of the current markup (as of 2008) of mwtab is available for reference.

The mwtags document contains the latest markup revisions.

Coding yet to be done

While it is felt that the current state of markup of the Monier Williams Sanskrit-English dictionary is adequate for many purposes, there are several areas which have been identified as candidates for useful improvements.

Other Cologne digitizations

Several other dictionaries have been digitized. The digitized Cappeller Sanskrit-English dictionary has been integrated with MONIER.ALL in the Sanskrit, Tamil and Pahlavi Dictionaries web site. A preliminary XML coding of the Apte English-Sanskrit dictionary has been done and is available for word look-up via the IITS web site. The list below provides links to samples of the various digitizations and accompanying scanned images. A goal of the CDSL project is to convert these into forms compatible with mwtab, so all dictionaries are available in a consistent encoding.

Cologne scanned editions

Eight Sanskrit dictionaries are currently available via the IITS web site in a form we refer to as 'scanned images'. This just means that the individual pages of the dictionaries have been scanned into images named in a certain consistent manner, and indexed by the first word on a page. For a user with fast internet access, a digitized edition provides access similar to that provided by a physical book. For the MW Sanskrit-English dictionary and the Apte English-Sanskrit dictionary, the developed web displays have links between individual words and the scanned editions; this link has proved so useful for MW that it is viewed as a desideratum for future displays of other digitized dictionaries.

Software tools and downloads

Several versions of Monier Williams Sanskrit-English dictionary are available in the download section of the IITS web site. If there is a need by users for parts of the software we have developed for maintaining, displaying, and otherwise using the digitized lexica of the IITS web site, we can make such software available (contact Thomas Malten).