[Top] [Contents] [Index] [ ? ]

The preTeX preprocessor

This file documents the purpose, usage, and technical details of preTeX, a package for preprocessing TeX documents to allow sophisticated typesetting based on natural-language rules (and particularly useful for typesetting Indian language documents written using an English transliteration).

This document applies to version 1.00.

1. Overview  What is preTeX?
2. Indian languages  Using preTeX to typeset Indian languages
3. More details on using preTeX in general  
4. Defining map files  How to control preTeX's behavior through its map files
5. How preTeX translates input to output  

A. What is a context-free grammar?  
B. Other packages that typeset Indian languages  
Concept Index  Index


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Overview

PreTeX was originally designed to make it easier to typeset various Indian languages using a standard Roman alphabet, but it can have much broader application.

At its heart, preTeX is a simple preprocessor for TeX and LaTeX documents. It is not a typesetting program; it relies on TeX to actually lay words out on the page. preTeX is a document translator. It converts strings of letters and symbols into different strings of letters and symbols, according to a set of rules laid out in one or more map files. The map files define the conversion using what is called a "context-free grammar," (see section A. What is a context-free grammar?) which allows a lot of fairly smart translations.

This is particularly useful for typesetting Indian languages. All the written Indian languages are alphabetic, like English, with consonant and vowel symbols, but unlike English there are many sets of symbols that are to be written as a single symbol when they appear together. This happens to a lesser extent in typeset English, when for instance the letters `f' and `i' appear in sequence, they are often typeset as a single glyph where the letters run together: `fi'. This is called a ligature.

In most Indian languages, there are a vast number of ligatures and complicated rules for combining letters to form them. In general, in each syllable there are one or more consonants followed by a single vowel; all of these letters are typically written together as one or two symbols. We would like to be able to write Indian language text using English letters, so that we could write `kyaa', for instance, in Devanagari, and expect to see the glyph for the consonants `ky' followed by the glyph for the vowel `aa'. However, if the consonant `r' appears before the vowel, then an accent mark should be written beneath the symbol, unless the leading consonants were any of a number of special consonants that have their own glyph when they blend with an `r'. And so on.

It is possible to describe all of these complex rules using a context-free grammar, such that certain letters are converted to a TeX sequence to print a particular glyph when they appear alone, but to a different TeX sequence to print a different glyph when they appear in conjunction with certain other letters. In fact, that is exactly what the map files supplied with preTeX do.

Because each transliteration scheme is completely defined by a map file, it is possible--even easy--for the user to modify the transliteration behavior to suit his or her personal tastes. No programming skill is necessary; it requires only a bit of clearheaded intuition to understand exactly how the map file works.

Furthermore, it is easy to extend preTeX to handle additional languages, or even to adapt it to tasks which are not related to typesetting. For instance, the map file `dnmeter.map' supplied with preTeX scans a bit of Devanagari verse and identifies the short and long syllables--an important characteristic of the verse--according to rules based on the proximity of certain vowels to a certain number of consonants.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Indian languages

PreTeX was designed to typeset Indian languages from English transliterated text, by preprocessing the original, transliterated document to produce a new document that can then be processed directly by TeX to produce the desired output.

It is not the only package that works in this way; there are several other Indian language typesetting programs (see section B. Other packages that typeset Indian languages) that take this approach, most of which are specific to one or two Indian languages (one notable exception, itrans, like preTeX, does aim to handle almost all Indian languages through a common interface--see B.1 Avinash Chopde's itrans). PreTeX owes a lot to some of these other packages; in addition to the basic philosophy of design, it depends on various fonts and TeX macros that were originally built for these other packages.

As currently shipped, preTeX can be used to typeset two Indian languages: Tamil and Devanagari.

2.1 General rules for typesetting Indian languages  
2.2 Tamil  Typesetting Tamil documents
2.3 Devanagari  Typesetting Devanagari documents


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 General rules for typesetting Indian languages

Each Indian language transliteration scheme is defined in a different map file, supplied with preTeX. To use a particular scheme, you'll need to include the appropriate \pretex command at the head of your document (see section 3.1 Referencing map files).

Subsequently, the text in your document that is Indian language text should be preceded by a keyword to indicate its language. For instance, if you include \pretex{tamil}, then text following a \tml keyword will be typeset in Tamil; if you include \pretex{devnag}, then text following a \dn keyword will be typeset in Devanagari. (The scoping rules for the keywords are actually a little more complicated than that. See section 3.2 Scoping rules for preTeX conversion.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 Tamil

Typesetting Tamil requires the wntamil font, a Metafont font designed at the University of Washington. See section B.2 University of Washington's wntamil.

To typeset in Tamil, you must include the sequence \pretex{tamil} at the beginning of your document (see section 3.1 Referencing map files). Subsequently, the keyword \tml may be used to mark text that should be typeset in Tamil.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.3 Devanagari

Typesetting Devanagari script requires the Devanagari font originally developed for Frans Velthuis' Devanagari package. See section B.3 Frans Velthuis' Devanagari package.

To typeset in Devanagari, you must include the sequence \pretex{devnag} at the beginning of your document (see section 3.1 Referencing map files). Subsequently, the keyword \dn may be used to mark text that should be typeset in Devanagari.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. More details on using preTeX in general

Although it is highly extensible through its use of user-defined map files, there are some behaviors that are specifically built into preTeX and cannot be changed without modification of the source code.

3.1 Referencing map files  Referencing map files in your TeX document
3.2 Scoping rules for preTeX conversion  Which text will be converted and which won't
3.3 Precompiled map (`.mpc') files  The implications of precompiled map files
3.4 Running preTeX  Command-line options and environment variables


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Referencing map files

A map file cannot be processed by preTeX unless it is explicitly included in the document, by the placement of an appropriate \pretex command in the beginning of a TeX document (or in the preamble of a LaTeX document). Although it resembles a TeX macro, \pretex is actually a command to preTeX to load in and process the indicated map file. The argument to \pretex, which must be enclosed in braces, is the name of the map file to load, without the `.map' extension. For instance, to load the map file `tamil.map', which includes the definitions to typeset Tamil, you would begin your document with \pretex{tamil}.

You can include multiple different \pretex commands in a single document, if you need to reference multiple map files within that document (for instance, if your document is written in multiple different Indian languages). Including a map file not only enables the given language for transliteration, but it also may implicitly include a number of other TeX commands necessary for typesetting the language; for instance, to load and define fonts. This is all defined within the map file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Scoping rules for preTeX conversion

The appearance of a map keyword in the text generally indicates that all subsequent text until the next closing brace should be translated according to the rules given in the map file. However, there are a few exceptions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.1 Math mode

Math mode, as marked by the appearance of a dollar sign (`$') or a double dollar sign (`$$'), temporarily ends text conversion. Text will not be converted until the following matching dollar sign or double dollar sign, ending math mode.

However, the LaTeX convention of `\(' and `\)' to delimit math mode is not respected.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.2 Comments

Text following a percent sign (`%'), until the end of line, is generally treated as a comment by TeX and is thus not converted by preTeX. In fact, the default behavior of preTeX is to strip comments out from the file during the conversion process.

Note that there may be some subtle cases in TeX in which a dollar sign does not indicate math mode, and a percent sign does not begin a comment. PreTeX cannot know about these peculiar cases; it always treats these characters as special (except when they are escaped by `\').


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.3 TeX macros

TeX macros, as identified by a backslash (`\') followed by either a single non-alphabetic character or any number of alphabetic characters, are generally not converted (although the map file may explicitly indicate otherwise). Arguments to macros will also not be converted, as long as they appear within braces (this is actually due to the `Intervening braces' rule, below).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.4 Intervening braces

If an opening brace appears following a keyword, text following that brace until its matching closing brace is not converted (unless another keyword appears within the braces). This rule allows TeX macros that take arguments within braces to be handled properly. (There are some exceptions--it is possible for a particular map file to specify that it should be `persistent' through opening braces. See section 4.5 The persistent map file command. It is also possible to specify this on the command line; see 3.4 Running preTeX.)

The following example demonstrates this:

 
\pretex{tamil}

This text will be typeset, unchanged, in English.
{As will this text.
\tml ka kaa ki kii ku kuu  % This text is interpreted as Tamil.
{ This text is untranslated, but the font may be incorrect. }
ke kee kai ko koo kau k  % After the brace closes, we return to Tamil.
} And here we return once more to English.

It is instructive to compare this example with its actual output after processing by preTeX:

 
\font\tmlfnt=wntml10
\def\c#1c{\char"#1{}}
\def\V#1{{\accent241 #1\discretionary{}{}{}}}
\def\tml{\tmlfnt}

This text will be typeset, unchanged, in English.
{As will this text.
\tml \c08c \c08ca \c0Ac \c0Bc \c0Cc \c0Dc
{ This text is untranslated, but the font may be incorrect. }
\c16c\c08c \c17c\c08c \c11c\c08c \c16c\c08ca \c17c\c08ca \c16c\c08c\c80c 
\V{\c08c}
} And here we return once more to English.

Note that the \pretex{tamil} command is replaced by a sequence of commands that define the Tamil font, and declare a pair of macros that are used to reference the Tamil glyphs. They also define the \tml keyword itself as a TeX macro that will activate the Tamil font.

In the subsequent document, all letters following the appearance of the \tml keyword are replaced, as instructed by the map file, with TeX commands to generate the appropriate glyphs in the Tamil font. Since the \tml keyword itself is not removed, TeX will switch to the Tamil font to properly typeset the Tamil glyphs.

Note the scoping rule for the nested passage: text within the nested braces after the \tml keyword is not converted. However, TeX itself does not automatically switch the font back to the Roman font for text within the nested curly braces (remember, the \tml keyword switched to the Tamil font), so the text in this example will almost certainly be typeset incorrectly.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Precompiled map (`.mpc') files

Some map files are quite large and complex (particularly `devnag.map', for instance), and it may take preTeX several seconds to read the file and prepare it for processing. To avoid having to face this delay each time you convert a document, preTeX will save out a `precompiled' image that contains all of the same information in the map file, in a predigested form, which preTeX can read much more quickly for future sessions instead of the actual map file.

This file is by default given the same name as the map file with the extension `.mpc', for instance, `devnag.mpc', and is typically written to a local temporary directory like `/usr/tmp'. It is possible to change this directory by setting the environment variable PRETEX_MPC to the directory you would prefer the `.mpc' files to be written to.

When preTeX attempts to load a map file, it first looks for a previously stored `.mpc' file. If an `.mpc' file exists (and the corresponding `.map' file hasn't recently been changed), the `.mpc' file is loaded instead.

Occasionally an `.mpc' file may get corrupted or damaged somehow, and it may be necessary to remove it. In general, it is always safe to remove the `.mpc' files, since they can be regenerated at any time from the source `.map' files.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4 Running preTeX

In normal operation, you will create a document with the extension `.ptx' (don't use the extension `.tex', since that's the filename that preTeX will try to generate). Then you may simply run preTeX on the document to generate the associated TeX input file.

For instance, suppose you had created the document `mydoc.ptx'. You would print it using the command sequence:

`pretex mydoc'
This reads the file `mydoc.ptx', processes it, and writes the file `mydoc.tex'.

`tex mydoc'
As any TeX session, this reads `mydoc.tex' and writes `mydoc.dvi'.

There are also a number of options that you may use to control preTeX from the command line. If included, they should appear after the command `pretex' but before the input document filename. Many of them are primarily useful for designing map files or for debugging preTeX itself. They are:

`-v'
Enables verbose output. This causes preTeX to be more informative about what it is doing. You can increase the information further by repeating this option, but unless you are debugging preTeX itself you probably don't want to use more than two. When this option appears alone on the command line, with no other options, it cases preTeX to simply print its version number and exit.

`-e limit'
Sets the number of errors that will be tolerated before giving up on the input file. Here `limit' is a nonnegative integer. When preTeX processes the input document, it will report each error it finds and then continue processing, until the document is finished or `limit' errors have been encountered. (The most common kind of error is the appearance of a character in a document that has no translation in the map file--this is an invalid character.)

`-t'
Show only the translated text. If this option appears, preTeX will not output a complete TeX document; instead, any character that is not one of the letters following a map keyword (and thus translated by the map file) will be replaced by whitespace. This is useful for eliminating clutter if you're only interested in seeing exactly what preTeX's conversion process is generating.

`-i'
Interlace the translated output line-for-line with the original source input. If this option appears, preTeX will copy each line from the input document, unchanged, to the output document, and then write the corresponding converted line below it. It is useful for doing a line-by-line comparison of the original and new documents. This switch is often used in conjunction with `-t', and is particularly useful with specialty map files like `dnmeter.map'.

`-m mapfile'

Preload the given mapfile before loading the input document. This has the same effect as putting the command `\pretex{mapfile}' at the beginning of the document. It's useful if you want to use a particular map file to process a document, without having to edit the document to make it reference the map file.

`-f'
Following `-m mapfile', activate (`force') the given map file translation for the overall document. This has the same effect as placing the map's keyword at the very beginning of the document. It's useful in conjunction with `-m mapfile' if you want to use the map file to process a document, but the document does not already reference the map file or make use of its keywords.

`-p'
Toggle the 'persistent' flag of all subsequently loaded maps. This causes the map to apply to text even within a nested pair of braces. This is often used in conjuction with `-f' and `-m mapfile', in this order: `-p -m mapfile -f'. This has the effect of processing the entire document, within and without braces, with the indicated map file.

`-l'
Show each complete map file description as it is read. This is primarily useful when designing map files; it allows the map file designer to see precisely how the file is being interpreted. It may help to clarify typographical errors in the map file.

`-L'
Like -l, but show a space between each distinct letter. This is an additional debugging tool when designing map files; it is useful to clarify precisely how the letters are being read in the map file. (PreTeX supports multiple-character `letters.' Since a space is written between the individual letters, this differentiates a single two-character letter from two one-character letters.)

`-V'
Show the decisions being made as possible translations are considered and rejected. This generates a lot of output. It's particularly useful when designing complex map files, to illustrate how preTeX is arriving at its translation of the input text. You probably only want to use this option on documents that only consist of one or two words; also see the `-c' option, below.

`-g'
Do not expand any nonterminals inline. Normally, when preTeX reads the map file, it tries to optimize matching time by collapsing simple nonterminal definitions into `inline' definitions. Using this option prevents this from happening. This is most useful in conjunction with `-V', because it makes it easier to relate the output of `-V' with the map file. Using this option also prevents the generation of `.mpc' files (see section 3.3 Precompiled map (`.mpc') files).

`-c'
Take the input text from the command line, instead of reading it from a source .ptx file. If this option appears, then the rest of the command line (after all of the options) is interpreted as actual text to convert, instead of as the name of an input document. In this case, the converted output is written to the standard output. This is useful for converting simple test words and phrases in conjunction with `-V' and `-g', as well as with `-m mapfile' and `-f'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. Defining map files

A preTeX map file is capable of defining one or more `languages,' where a language is considered to be a single set of rules for converting text, associated with a particular keyword. By convention, only one language is defined in each map file, although this need not be the case.

The map file consists of a sequence of lines. Each line is either empty, a command, an argument for a previous command, or a command and argument pair in a single line. It is also valid to include comments throughout the map file; as in TeX, a comment is marked by a percent sign (`%'), and extends to the end of the line.

In general, commands always begin flush with the left margin in the file: there is no whitespace before a command. On the other hand, arguments are always indented; there must be at least one space before each argument to differentiate it from a command.

The following commands are supported:

4.1 The language map file command  Begins a new language definition
4.2 The keyword map file command  Defines one or more keywords that activate this language
4.3 The top map file command  Specify a sequence of TeX commands to insert
4.4 The font map file command  Define a TeX attribute to use for certain expansions
4.5 The persistent map file command  Specify that this language is applied through braces
4.6 The alphabet map file command  Define the lexical alphabet that will be used
4.7 The map map file command  The context-free grammer that defines the translation

In the following descriptions, many of the examples are taken from the `tamil.map', distributed with preTeX.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.1 The language map file command

This command begins a new language definition. It must be the first command to appear in a map file. All commands following this point in the map file, up until the next language command, will relate to this language.

The argument is the name of the language, and is usually specified on the same line with the language command. This name is presently not used by preTeX; it is strictly for user information.

Example:

 
language Tamil

This appears at the top of `tamil.map', to begin the definition for the Tamil language rules.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2 The keyword map file command

This command identifies the keyword or keywords that will activate the use of the conversion rules for this language. The argument, which is usually specified on the same line with the keyword command, consists of one or more words, each of which will be treated as an equivalent keyword. When the keyword appears in the document as a TeX macro (i.e. followed by a backslash), then subsequent letters in the document up until the next closing brace will be converted according to the rules in this map file (see section 3.2 Scoping rules for preTeX conversion).

Note that the keyword itself is not normally removed from the document; thus, the top command (below) should generally also define the keyword as a TeX macro to enable the appropriate font, or do whatever other setup is necessary.

It is legal, but of questionable value, to omit the keyword command, since a language without keywords cannot ever be used (except via the `-f' command-line option; see 3.4 Running preTeX.)

Example:

 
keyword tml

This sets up the keyword `\tml' to activate Tamil text. Thus, preTeX would convert the following text:

 
ka kaa \tml ki kii ku kuu

To the following:

 
ka kaa \tml \c0Ac \c0Bc \c0Cc \c0Dc

The `\tml' keyword marks the beginning of Tamil conversion, but the `\tml' keyword itself is not removed (or converted in any way).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.3 The top map file command

The arguments to this command (which are typically given in the following lines) are a number of TeX commands that will be written to the output document in place of the `\pretex' command that activated this language. These TeX commands are not interpreted; they are simply copied verbatim to the output file.

This is generally used to do any necessary TeX setup, and to define a TeX macro for each keyword defined above. This command is optional.

Example:

 
top
  \font\tmlfnt=wntml10
  \def\c#1c{\char"#1{}}
  \def\V#1{{\accent241 #1\discretionary{}{}{}}}
  \def\tml{\tmlfnt}

The first line, `\font\tmlfnt=wntml10', tells TeX to prepare the Tamil font, `wntml10', for use in this document. The next two lines define some TeX macros that will be used in the map expansion rules for Tamil. The last line, `\def\tml{\tmlfnt}', sets up the `\tml' keyword as a TeX macro that enables the Tamil font.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.4 The font map file command

Sometimes it is necessary to switch TeX fonts in order to typeset certain symbols in a language (which may not be defined in the main font). This optional command is provided as convenient way to facilitate this.

The arguments to this command consistent of one or more font name / definition pairs, one per line. The font name may be any one-word name. The definition gives the TeX sequence to typeset a particular passage of text in the indicated font (or, for that matter, to apply any special formatting properties to the indicated passage of text); the hash mark (`#') symbol stands for the text passage.

The preTeX `font' defined by this command is not necessarily related to any TeX font, nor indeed is it necessarily a font at all; it just becomes a name to refer to some particular TeX sequence that changes the typesetting properties.

Example:

The wntamil font includes glyphs for all of the Tamil letters, but it does not include glyphs for the digits or punctuation marks. Thus, in order to typeset any of these, it is necessary to switch to the Roman font. We thus define a preTeX font, which we will call `roman':

 
font roman {\rm #}

This declaration makes the keyword `roman' available when defining conversions within the map section (below). In `tamil.map', all of the digits and punctuation symbols are declared using this `roman' keyword, which tells preTeX to write punctuation symbols and digits beteen `{\rm' and `}'.

For instance, the following text:

 
\tml ka 123 kaa

Would be translated to:

 
\tml \c08c {\rm 123} \c08ca


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.5 The persistent map file command

Normally, preTeX will stop text translation when it encounters an opening brace, and resume again when the matching closing brace is encountered (see section 3.2 Scoping rules for preTeX conversion). When the persistent command appears in a map file, however, that language is deemed to be `persistent' past an opening brace, and all text following the keyword (until the next closing brace) will be translated, even text within a nested pair of braces.

This command is optional (in fact, unusual), and takes no arguments.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.6 The alphabet map file command

This command defines the `letters' that are used in the language. Each individual letter may consist of any number of ASCII characters, but is typically only one or two.

The arguments to this command are the letters of the alphabet, generally one per line. If any two or more letters appear on the same line together, they are considered by preTeX to be alternate ways to write the same letter.

All of the individual ASCII characters are already defined as letters in their own right, so this command is never necessary unless you want to define multiple-character letters, or you need to define two or more letters to be equivalent.

Example:

 
alphabet
        k g kh gh
        "n
        c ch
        ~n
        .d .t .dh .th
        .n
        m
        d t dh th
        p b ph bh
        m
        y
        r
        l
        v
        zh
        .l
        =r
        'n
        "s
        .s
        s
        j jh
        h
        k.s
        a
        aa
        i
        ii
        u
        uu
        e
        ee
        ai
        o
        oo
        au
        H
        .h
        M
        srii
        .r
        .a

With this alphabet in effect, the Tamil word `ko.n.da' would be interpreted as a five-letter word: `k', `o', `.n', `.d', and `a'. Furthermore, it would still be the same five-letter word if it were written as `go.n.tha'.

Note that all of the one-character letters defined above on lines by themselves are unnecessary, because all of the one-character letters are already defined. They are included in `tamil.map' strictly for completeness.

Multiple-character letters are always interpreted as the longest possible letter, even when alternate interpretations are possible, and regardless of the relative priorities given in the map section (below). For example, the Tamil sequence `ai' is always interpreted as the single vowel `ai', and never as the vowel `a' followed by the vowel `i'. If you need to allow `a' and `i' to be recognized individually, you will either need to change the alphabet and remove `ai', or require the user to type `a{}i', or provide a `do-nothing' letter, such as an underscore (`_'), which does not produce any output, and will thus allow the user to type `a_i' to mean `a' followed by `i'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7 The map map file command

This command is the main body of the map file. This describes the context-free grammar (see section A. What is a context-free grammar?) that defines the translation performed for this language. The grammar consists of a number of nonterminal declarations, one of which must be named `<root>'. See section 5. How preTeX translates input to output.

4.7.1 Basic nonterminal declaration  The basic structure of a nonterminal declaration
4.7.2 Modifying the nonterminal declaration  
4.7.3 Specifying a priority  
4.7.4 Special predefined nonterminals  Some special predefined nonterminals
4.7.5 Limitations of the grammar  Some things you can't represent
4.7.6 Simple examples of nonterminals  Some simple examples


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.1 Basic nonterminal declaration

A nonterminal declaration consists of the nonterminal name, indented at least one space (to differentiate it from the map command), followed by the string `::=', and optionally followed by one or more keywords and/or a relative priority that apply to the nonterminal definition, all on one line.

Each subsequent line defines a match/replacement string pair for the nonterminal. Each line must be indented at least one space, and contain a string that the nonterminal might match in the input document, followed by whitespace and the corresponding replacement string, all on one line.

There are two different kinds of match strings. The first kind consists entirely of literal text. This is called a terminal reference. In this case, when the matched text appears in the input document, it is simply replaced on the output document by the corresponding replacement text.

The second kind consists of some combination of literal text and one or more nonterminal names (possibly the name of the same nonterminal). This is called a nonterminal reference. In this case, a match is made when the text in the input document matches all of the referenced nonterminals, in order, including any literal text in the match string. When the input document text is so matched, it is again replaced by the corresponding replacement text. The replacement text may optionally also include any of the same nonterminal names that were referenced in the match text; in this case, the corresponding output string by the referenced nonterminal is output.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.2 Modifying the nonterminal declaration

There are a number of optional keywords that may be included on the line that defines a nonterminal, following the `::=' operator. These keywords are not to be confused with the keyword command in the map file (see section 4.2 The keyword map file command)---these are special words which modify the nonterminal definition.

The use of most of these keywords is actually discouraged. Generally, they are unnecessary, and only add complexity to a language definition. However, there do exist rare occasions when it is appropriate to use each of them.

Although the keywords are specified on the same line with the `::=' operator, in some cases it makes sense for a keyword to be applied to some but not all of a nonterminal's expansions. To do this, simply repeat the nonterminal definition line with the new keywords listed, e.g.:

 
<map> ::=
  abc
  def

<map> ::= literal
  ghi


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.2.1 inline

The inline keyword is a hint to preTeX that this nonterminal should always be evaluated inline instead of in the normal way. See section 5.5 Inline expansions. In general, you should not need to use this keyword, because preTeX can do a pretty good job of figuring out by itself which keywords should be made inline.

The inline keyword always applies to the entire nonterminal, even if it is only defined for some of the nonterminal's expansions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.2.2 greedy

A greedy nonterminal is handled in a special way: it always matches the longest possible string in the input document it can, regardless of its priority (see section 5.4 The effect of relative priority).

For example:

 
<number> ::= greedy
  <digit>                         <digit>
  <digit><number>                 <digit><number>

In this example, the nonterminal `<number>', if it matches anything at all, will always match an entire string of consecutive digits--even if some other nonterminal definition with a higher priority might have matched one of the digits.

The use of the greedy keyword is discouraged. It is usually unnecessary, because preTeX generally prefers the longest possible match anyway (see section 5.3 How preTeX chooses the best match of several possible choices), and its use interferes with the proper execution of relative priorities.

The primary advantage to using greedy is that of performance. Since the rules for matching a greedy nonterminal are a little simpler, it may be slightly faster for preTeX to evaluate a greedy nonterminal than a normal one.

The greedy keyword always applies to the entire nonterminal, even if it is only defined for some of the nonterminal's expansions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.2.3 literal

The literal keyword is used to mark one or more expansions of a nonterminal that are to be interpreted literally, e.g. they are not to be scanned for other nonterminal references. This keyword is only necessary if you need your nonterminal to match a string that happens to contain the name of another nonterminal! The use of this keyword is therefore somewhat limited. It's probably a better idea just to rename the nonterminal in question so it doesn't look like any string that you might expect to read from the input.

The literal keyword only applies to those expansions of the nonterminal for which it is explicitly specified.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.2.4 except

The except keyword is an extremely powerful keyword, and it is subject to overuse by a beginning map file designer. It marks one or more expansions of a nonterminal that are exceptions to the normal rule: things which the nonterminal should not match.

For example:

 
<lucky-number> ::=
  <digit>                         <digit>
  <digit><digit>                  <digit><digit>

<lucky-number> ::= except
  13

In this example, the nonterminal `<lucky-number>' will match any one- or two-digit number except the number 13.

The use of the except keyword is discouraged. Generally, if you need a particular string not to be matched by a certain nonterminal, you will be better off defining the correct match for that string using a different nonterminal, and giving it a higher priority (see section 4.7.3 Specifying a priority).

The except keyword only applies to those expansions of the nonterminal for which it is explicitly specified. In fact, it never makes sense to apply this keyword to an entire nonterminal's definition, because the nonterminal could then never match anything.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.3 Specifying a priority

At times it is necessary to explicitly specify that certain expansions should be preferred over others. This can be done by specifying a relative priority for the expansion(s) in question. (see section 5.4 The effect of relative priority)

In general, when a number appears following the `::=' symbol for a nonterminal definition, that number is added to the priority for any string that includes that match. If the priority number is positive, the match will be preferred over other matches; if it is negative, the other matches will be preferred instead.

For example:

 
<lucky-number> ::=
  <digit>                         <digit>
  <digit><digit>                  <digit><digit>

<unlucky-number> ::= 10
  13                              (13)

This is similar to the example for the except keyword (see section 4.7.2 Modifying the nonterminal declaration), demonstrating how priority can be used to achieve a similar effect. In this example, the nonterminal `<lucky-number>' will match any one- or two-digit number. However, the nonterminal `<unlucky-number>' will specifically match the number 13, which is a two-digit number. By assigning `<unlucky-number>' the relative priority of 10, we guarantee that preTeX will always use `<unlucky-number>' to match the number 13, rather than `<lucky-number>'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.4 Special predefined nonterminals

Certain nonterminals are implicitly already defined in every map file. These are provided for the convenience of the map file designer. Some of them provide the only way to reference a particular special character in the map file (like a space character, for instance); others are simply conveniences.

The predefined nonterminals are:

`<empty-string>'
This stands for the empty string, or nothing at all. This is quite useful for designing sophisticated grammars. If a nonterminal matches `<empty-string>', it can be matched without using any characters from the input document (most nonterminals also define some nonempty matches). There are certain limitations regarding a nonterminal that matches nothing; see See section 4.7.5 Limitations of the grammar.

`<space>'
This stands for a space character, and only a space character. This is the character that corresponds to a single press of the space bar.

`<tab>'
This stands for a tab character, and only a tab character. This is usually the character that corresponds to a single press of the tab key.

`<spaces>'
This stands for any number (including zero) of spaces and/or tab characters. However, an intervening TeX macro will block this match.

`<newline>'
This stands for a newline character, the invisible character that marks the end of a line.

`<whitespace>'
This stands for any number of spaces, tabs, newlines, and/or TeX macros. However, there must be at least some whitespace.

`<empty-whitespace>'
This stands for any number of spaces, tabs, newlines, and/or TeX macros, as well as for nothing at all.

`<percent>'
This stands for the percent sign, `%'. It requires a special nonterminal because the percent sign is the preTeX command character, and if it appears in a map file it is taken as the beginning of a comment. Thus, the only way to match (or output) a percent symbol is to use this nonterminal.

`<any>'
This stands for any non-whitespace character. Any letter, digit, or punctuation mark.

`<texmacro>'
This stands for a single TeX macro. It is rare that you will need to match this explicitly.

Note that the special nonterminals for matching whitespace and TeX macros are fairly special-purpose. You don't normally need to try to provide definitions for these explicitly in the map file; preTeX's normal behavior is to transmit them to the output document unchanged if you do not mention them. However, it is occasionally useful to build a grammar that can change the whitespace and/or intervening TeX macros as well as the normal text.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.5 Limitations of the grammar

Nonterminal definitions must not be recursive without consuming characters. *** Explain this.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.7.6 Simple examples of nonterminals

 
<root> ::=
    <digit>                         <digit>
    a                               \c00c
    aa                              \c01c
    i                               \c02c
    ii                              \c03c
    <consonant>                     \V{<consonant>}
    <consonant>a                    <consonant>
    <consonant>e                    \c16c<consonant>
    <consonant>u                    <consonant>\c0F2c

In this abbreviated example, the nonterminal `<root>' might match whatever `<digit>' matches, in which case the corresponding replacement is exactly whatever `<digit>' indicated it should be. Or it might match any of the vowels `a', `aa', `i', or `ii', in which case the replacement is one of four different glyphs, which presumably represent the corresponding vowels in Tamil. Finally, it might match anything that `<consonant>' matches, either alone or followed by one of the vowels `a', `e', or `u', in which case the replacement string is whatever `<consonant>' produces, along with some other glyph, either before, after, or around it, according to the vowel.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5. How preTeX translates input to output

Once preTeX has started translating text in the input document (i.e. for text in the document that appears after the map file keyword--see 3.2 Scoping rules for preTeX conversion), it follows some fairly complex, but predictable, rules to convert the text, as defined by the alphabet (see section 4.6 The alphabet map file command) and map (see section 4.7 The map map file command) sections of the map file.

5.1 Grouping the input into letters  
5.2 How preTeX uses the `<root>' nonterminal  
5.3 How preTeX chooses the best match of several possible choices  
5.4 The effect of relative priority  
5.5 Inline expansions  How preTeX optimizes the grammar


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.1 Grouping the input into letters


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.2 How preTeX uses the `<root>' nonterminal


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.3 How preTeX chooses the best match of several possible choices


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4 The effect of relative priority


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.5 Inline expansions


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

A. What is a context-free grammar?

A context-free grammar is a mathematical way to represent a language. In general, it starts with a symbol that may represent any one of a handful of letters, or strings of letters:

 
<N> ::= 
   a
   b
   c

We say that `<N>' is a nonterminal that may stand for any of `a', `b', or `c'. In this example, `a', `b', and `c' are called terminals, because they don't stand for anything else--just themselves.

The grammar becomes interesting when we let `<N>' stand for other nonterminals as well:

 
<N> ::=
   a
   b<E>
   <E><E>

<E> ::=
   cd
   ef

In this example, `<E>' could stand for either `cd' or `ef', and `<N>' could stand for either the letter `a', or the letter `b' followed by whatever `<E>' stands for, or two occurrences of whatever `<E>' stands for. To be explicit, then, `<N>' could be any of `a', `bcd', `bef', `cdcd', `cdef', `efcd', or `efef'.

Finally, a nonterminal may even stand for itself, which leads to quite a lot of power. Here is a nonterminal that stands for all of the odd palindromes of `a', `b', and `c':

 
<P> ::=
   a
   b
   c
   a<P>a
   b<P>b
   c<P>c

If you trace this through, you should be able to see that `<P>' stands for `cac', `bab', `bccbabccb', and `abbccbccbba' (for instance), but not `abc' or `abbbc'.

PreTeX extends this concept of a context-free grammar by adding an arbitrary replacement string to correspond to each string a nonterminal stands for. In general, preTeX works by repeatedly matching a sequence of letters from the input document against something the root nonterminal stands for, and writing the corresponding replacement string to the output document. See section 5. How preTeX translates input to output.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B. Other packages that typeset Indian languages

All of the following packages can be found on CTAN, the Comprehensive TeX Archive Network. You can browse this archive at http://tug2.cs.umb.edu/ctan/.

B.1 Avinash Chopde's itrans  A general Indian language solution by Avinash Chopde
B.2 University of Washington's wntamil  A Tamil conversion program developed at UW
B.3 Frans Velthuis' Devanagari package  Frans Velthuis' program to typeset Devanagari


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.1 Avinash Chopde's itrans

This package has much in common with preTeX's goals. In particular, it seeks to provide a common interface to typeset as many different Indian languages as possible. Its transliteration scheme is also to a certain degree user-definable, although it is not quite as flexible as a context-free grammar, and it cannot easily be extended to other applications. But what it does, it does well.

As of this writing, it supports Devanagari, Tamil, Telugu, and Bengali. You can download itrans from the CTAN directory `language/indian/itrans'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.2 University of Washington's wntamil

This package is strictly for typesetting Tamil documents. It is quite speedy although the transliteration rules are fixed. The Tamil font used by preTeX was originally developed for this package, and may be found here.

You can download wntamil from the CTAN directory `language/tamil/wntamil'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.3 Frans Velthuis' Devanagari package

This package is strictly for typesetting Devanagari (and Hindi) documents. Its transliteration rules are quite nice (in fact, preTeX's `devnag.map' file is designed to emulate Devanagari's transliteration rules), but it is not user-extensible. The Devanagari font used by preTeX was developed by Frans Velthuis for use with this package, and may be found here.

Devanagari may be found in the CTAN directory `language/devanagari/distrib'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Concept Index

Jump to:   I   O  

Index Entry Section

I
Indian languages2. Indian languages

O
Overview1. Overview

Jump to:   I   O  


[Top] [Contents] [Index] [ ? ]

Table of Contents


[Top] [Contents] [Index] [ ? ]

Short Table of Contents

1. Overview
2. Indian languages
3. More details on using preTeX in general
4. Defining map files
5. How preTeX translates input to output
A. What is a context-free grammar?
B. Other packages that typeset Indian languages
Concept Index

[Top] [Contents] [Index] [ ? ]

About this document

This document was generated by David Rose on March, 31 2004 using texi2html

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack previous or up-and-previous section 1.1
[ Up ] Up up section 1.2
[ >> ] FastForward next or up-and-next section 1.3
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index concept index  
[ ? ] About this page  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

This document was generated by David Rose on March, 31 2004 using texi2html