Current markup of Monier Williams Sanskrit-English Dictionary

This document describes the current markup of the Monier Williams Sanskrit-English Dictionary at the University of Cologne, as of April 2008.

Outline

General Structure
Structure common to each record
  root elements <H1> <H1A> <H1B> <H2> <H2A> <H2B> <H3> <H3A> <H3B> <H4> <H4A> <H4B> <HPW>
  root children <h> <body> <tail>
The head section
  <hc3>   <key1>   <hc1>
  <key2>    ( special chars   <TWOWORDS/>   <sr/> <sr1/>   <srs/> <srs1/>   <shortlong/>   <root> )
  <hom>
The body section
  marking special characters ( <b> <b1>   <p> <p1>   <c> <c1> <c2><c3>   <quote>   <sr/> <sr1/>   <fcom/>   <abE/>   <srs/> <srs1/>   <shortlong/> <shc/>   <auml/> <euml/> <ouml/> <uuml/>   <etc/> <etc1/> <etcetc/>   <amp/>   <eq/>   <fs/>   <msc/>   <ccom/> )
  marking special text ( <ab>   <etym>   <s>   <as0> <asp0> <as1>   <ns>   <bot> <bio>   <root> and <root/>   <to/>   <ls>   <lex>   <vlex>   <hom>   <pc> <pcol>   <phw>   <ORSL> )
  experimental marking ( <usage> <idiom> <sense> <ellipsis/>   <loan/> )

  marking references to other records ( <cf/>   <qv/>   <see/>   <pL> <dL>   <AND/> <OR/> )
The tail section ( <L>   <pc>   <MW>   <mul/>   <mat/>   <mscverb/> )


General Structure

The dictionary appears as a sequence of 280081 records. Each record is coded as a text string (using the standard ASCII 7-bit character set) which has a well-formed XML structure. A Perl program (mwtab_wf.pl) is used periodically to check that each record is well-formed. Dictionary entries for verbs correspond to records. Dictionary entries for non-verbs correspond to record sequences, with each record corresponding to a sense of the entry. The ordering of records corresponds to the dictionary ordering and is maintained by an element <L> whose content is a numerical identifier.
From an
examination of global statistics on frequency of element occurence, one notes that 103 element-attribute variations occur; there are on average about 17 elements per record.

There is no DTD against which to determine the validity of the XML structure. However, there are regularities to the markup. Most of the following remarks are aimed at describing these regularities.

Structure common to each record

Each record has an overall form using one of 13 'H' root elements and a set of 9 other elements.
The 13 root elements are

<H1> <H1A> <H1B> <H2> <H2A> <H2B> <H3> <H3A> <H3B> <H4> <H4A> <H4B> <HPW> .
The 9 common elements appearing in each record are
root children: <h> <body> <tail>
children of <h> : <hc1> <hc3> <key1> <key2>
children of <tail> : <L> <pc>

As a basis of discussion, consider the first record of the coded dictionary:
<H1><h><hc3>000</hc3><key1>a</key1><hc1>1</hc1><key2>a</key2><hom>1</hom></h><body> <c>the_first_letter_of_the_alphabet</c> </body><tail><pc>Page1,1</pc> <L>1</L></tail></H1>

root elements

One of the 'H' elements forms the root element. There are four basic H elements, named H1,H2,H3,H4; these are intended to correspond to the 'four mutually correlated lines of Sanskrit words' described at
Page xiv: Section II of the dictionary preface. These were coded in MONIER.ALL based upon the delineating typographic features of the text.
For non-verbs with multiple senses, these four basic elements are further refined by a suffix 'A' or 'B'. 'A' means that the speficic lexical information (e.g., masculine, feminine, neuter) is the same as that of the preceding 'parent' entry; 'B' means that the specific lexical information differs from that of the 'parent'. Currently there is no pointer from an 'A' or 'B' child to its parent; the relation is implicit in the ordering of the records, but an explicit coding of this relation should probably be done.
For instance, the second record of the dictionary provides an alternate sense to the first record shown above, and is coded with root element 'H1A':
<H1A><h><hc3>000</hc3><key1>a</key1><hc1>1</hc1><key2>a</key2><hom>1</hom></h><body> <c>the_first_short_vowel_inherent_in_consonants.</c> </body><tail><pc>Page1,1</pc> <MW>000001</MW> <L>1.1</L></tail></H1A>

Currently, the different 'parts' of a record for a verb are not separated into separate records, but are distinguished by an empty element <msc/> (coded in MONIER.ALL as '{;}') because the dictionary generally uses a semicolon for this separation (but semicolons are appear elsewhere , so this is a special semicolon).
There is a fifth form, 'HPW', which the root element name may take. This was devised to allow a separate records for 'parenthetical head words', discussed further below
One could imagine a different coding of the information of the 13 'H' elements in which there would be a single 'H' element with an attribute having 13 different value; e.g., <H type='1'> instead of <H1>.

root children

The children of the root element are the <h> <body> <tail> elements, in that order. The <h> (or 'head') element contains information about the head word. The <body> contains the definitional material of the record. The <tail> contains record sequencing information and a pointer to the scanned image of the pages of the dictionary.

The head section

The <h> element contains the elements <hc3> <key1> <hc1> <key2> and optionally <hom> , in that order.

The body section

Except for the headword information of the <h> element and the sequencing information of the <tail> element, the dictionary text appears within the <body> element of the records. The multiplicity of forms of this text has thus far resisted efforts at fitting into a useful DTD, although one could describe a complex XML structure to which the markup would conform. In general, any element can be a child of any other within the <body>, and the elements can appear in any order.

marking special characters

Some elements mark individual characters

marking special text

Some text is special in one way or another. This includes Sanskrit text, abbreviated text, textual representations of words in other languages, technical text, grammatical text, referential text. This section describes the various schemes devised to mark some of these.

experimental marking

The following are experimental markings; there is only one instance of each in the current markup of the dictionary.

marking references to other records

A small number of elements indicate references to other records or words.

The tail section

The <tail> element of every record contains the <L> and <pc> elements; some records also variously contain <MW> , <mul/> , <mat/> and <mscverb/> elements. The ordering of these elements varies.