Section 1: Introduction
This is , a document compiler intended to produce typesetting of high quality. The Pascal program that follows is the definition of , a standard version of that is designed to be highly portable so that identical output will be obtainable on a great variety of computers.
The main purpose of the following program is to explain the algorithms of as clearly as possible.
As a result, the program will not necessarily be very efficient when a particular Pascal compiler has translated it into a particular machine language.
However, the program has been written so that it can be tuned to run efficiently in a wide variety of operating environments by making comparatively few changes.
Such flexibility is possible because the documentation that follows is written in the WEB
language, which is at a higher level than Pascal;
the preprocessing step that converts WEB
to Pascal is able to introduce most of the necessary refinements.
Semi-automatic translation to other languages is also feasible, because the program below does not make extensive use of features that are peculiar to Pascal.
A large piece of software like has inherent complexity that cannot be reduced below a certain level of difficulty, although each individual part is fairly simple by itself.
The WEB
language is intended to make the algorithms as readable as possible, by reflecting the way the individual program pieces fit together and by providing the cross-references that connect different parts.
Detailed comments about what is going on, and about why things were done in certain ways, have been liberally sprinkled throughout the program.
These comments explain features of the implementation, but they rarely attempt to explain the language itself, since the reader is supposed to be familiar with The TeXbook.
Donald Knuth is the author of . The commentaries outside of these notes are the original from : The Program (with minor modifications such as constants written in uppercases, but the original text stays unchanged).
The code is a conversion from Pascal into C mostly based on the automatic conversion tool WEB to cweb by Martin Ruckert, with custom adjustments (most of them is the format, and a lot of braces added for conditions followed by a single instruction for example).
It uses Markdown instead of
WEB
to write the documentation and the code in fenced blocks. Thanks to mdBook, it is rendered as web pages, and a Python script is used to convert this in compilable code. More details are given in the README.
Section 2
The present implementation has a long ancestry, beginning in the summer of 1977, when Michael F. Plass and Frank M. Liang designed and coded a prototype based on some specifications that the author had made in May of that year.
This original proto included macro definitions and elementary manipulations on boxes and glue, but it did not have line-breaking, page-breaking, mathematical formulas, alignment routines, error recovery, or the present semantic nest; furthermore, it used character lists instead of token lists, so that a control sequence like \halign
was represented by a list of seven characters.
A complete version of was designed and coded by the author in late 1977 and early 1978; that program, like its prototype, was written in the SAIL language, for which an excellent debugging system was available.
Preliminary plans to convert the SAIL code into a form somewhat like the present “web” were developed by Luis Trabb Pardo and the author at the beginning of 1979, and a complete implementation was created by Ignacio A. Zabala in 1979 and 1980.
The program, which was written by the author during the latter part of 1981 and the early part of 1982, also incorporates ideas from the 1979 implementation of in MESA that was written by Leonidas Guibas, Robert Sedgewick, and Douglas Wyatt at the Xerox Palo Alto Research Center.
Several hundred refinements were introduced into based on the experiences gained with the original implementations, so that essentially every part of the system has been substantially improved.
After the appearance of “Version 0” in September 1982, this program benefited greatly from the comments of many other people, notably David R. Fuchs and Howard W. Trickey.
A final revision in September 1989 extended the input character set to eight-bit codes and introduced the ability to hyphenate words from different languages, based on some ideas of Michael J. Ferguson.
No doubt there still is plenty of room for improvement, but the author is firmly committed to keeping “frozen” from now on; stability and reliability are to be its main virtues.
On the other hand, the WEB
description can be extended without changing the core of itself, and the program has been designed so that such extensions are not extremely difficult to make.
The BANNER string defined here should be changed whenever undergoes any modifications, so that it will be clear which version of might be the guilty party when a problem arises.
If this program is changed, the resulting system should not be called ‘’; the official name ‘’ by itself is reserved for software systems that are fully compatible with each other.
A special test suite called the “TRIP
test” is available for helping to determine whether a particular implementation deserves to be known as ‘’ [cf. Stanford Computer Science report CS1027, November 1984].
Constants are written in capital letters.
This C implementation passes the
TRIP
test, and did not change anything of the core of so it can be called .
// << Start file |constants.h|, 1381 >>
#define BANNER "This is TeX, Version 3.141592653" // printed when TeX starts
Section 3
Different Pascal s have slightly different conventions, and the present program expresses in terms of the Pascal that was available to the author in 1982. Constructions that apply to this particular compiler, which we shall call Pascal-H, should help the reader see how to make an appropriate interface for other systems if necessary. (Pascal-H is Charles Hedrick’s modification of a compiler for the DECsystem-10 that was originally developed at the University of Hamburg; cf. Software—Practice and Experience 6 (1976), 29–42. The program below is intended to be adaptable, without extensive changes, to most other versions of Pascal, so it does not fully use the admirable features of Pascal-H. Indeed, a conscious effort has been made here to avoid using several idiosyncratic features of standard Pascal itself, so that most of the code can be translated mechanically into other high-level languages. For example, the ‘with’ and ‘new’ features are not used, nor are pointer types, set types, or enumerated scalar types; there are no ‘var’ parameters, except in the case of files; there are no tag fields on variant records; there are no assignments real ← integer; no procedures are declared local to other procedures.)
The portions of this program that involve system-dependent code, where changes might be necessary because of differences between Pascal compilers and/or differences between operating systems, can be identified by looking at the sections whose numbers are listed under ‘system dependencies’ in the index. Furthermore, the index entries for ‘dirty Pascal’ list all places where the restrictions of Pascal have not been followed perfectly, for one reason or another.
Incidentally, Pascal’s standard round function can be problematical, because it disagrees with the IEEE floating-point standard. Many implementors have therefore chosen to substitute their own home-grown rounding procedure.
Section 4
The program begins with a normal Pascal program heading, whose components will be filled in later, using the conventions of WEB
.
For example, the portion of the program called ‘⟨ Global variables, 13 ⟩’ below will be replaced by a sequence of variable declarations that starts in Section 13 of this documentation.
In this way, we are able to define each individual global variable when we are prepared to understand what it means; we do not have to define all of the globals at once.
Cross references in section 13, where it says “See also
sections 26, 30, ”, also make it possible to look at the set of all global variables, if desired.
Similar remarks apply to the other portions of the program heading.
Actually the heading shown here is not quite normal: The program line does not mention any output file, because Pascal-H would ask the user to specify a file name if output were specified here.
A few named blocks are not needed:
- ⟨ Compiler directives 9 ⟩;
- ⟨ Label in the outer block 6 ⟩.
Also, other named blocks are removed and replaced by a filename. Here, for example:
- ⟨ Constants in the outer block 11 ⟩: in header file
constants.h
;- ⟨ Types in the outer block 18 ⟩: in
tex.h
;- ⟨ Global variables, 13 ⟩: in
global.c
(variables are declaredextern
intex.h
, see section 1380);- ⟨ Basic printing procedures 57 ⟩: in
basic_printing.c
;- ⟨ Error handling procedures 78 ⟩: in
error.c
.The same happens for others later.
// << Start file |tex.h|, 1381 >>
// << Types in the outer block, 18 >>
// << Start file |global.c|, 1382 >>
// << Global variables, 13 >>
// << Start file |init_cleanup.c|, 1382 >>
void initialize() {
// << Local variables for initialization, 19 >>
// << Initialize whatever TeX might access, 8 >>
}
Section 5
The overall program begins with the heading just shown, after which comes a bunch of procedure declarations and function declarations.
Finally we will get to the main program, which begins with the comment ‘start_here’.
If you want to skip down to the main program now, you can look up ‘start_here’ in the index.
But the author suggests that the best way to understand this program is to follow pretty much the order of ’s components as they appear in the WEB
description you are now reading, since the present ordering is intended to combine the advantages of the “bottom up” and “top down” approaches to the problem of understanding a somewhat complicated system.
Section 6
Three labels must be declared in the main program, so we give them symbolic names.
No need to declare labels, only local labels (such as
found
will be declared when needed).
Section 7
Some of the code below is intended to be used only when diagnosing the strange behavior that sometimes occurs when is being installed or when system wizards are fooling around with without quite knowing what they are doing. Such code will not normally be compiled; it is delimited by the codewords ‘debug gubed’, with apologies to people who wish to preserve the purity of English.
Similarly, there is some conditional code delimited by ‘stat tats’ that is intended for use when statistics are to be kept about ’s memory usage.
The stat tats code also implements diagnostic information for \tracingparagraphs
, \tracingpages
, and \tracingrestores
.
Macro preprocessor
#ifdef
is used to define those blocks:
#ifdef DEBUG
for debug;#ifdef STAT
for stat;- The closing is only
#endif
. The flags-DDEBUG
and-DSTAT
can be used to activate them at compilation time (see the Makefile).
Section 8
This program has two important variations:
(1) There is a long and slow version called INITEX
, which does the extra calculations needed to initialize ’s internal tables;
and (2) there is a shorter and faster production version, which cuts the initialization to a bare minimum.
Parts of the program that are needed in (1) but not in (2) are delimited by the codewords ‘init tini’.
The specific code for
INITEX
is inside block#ifdef INIT
.
⟨ Initialize whatever TeX might access 8 ⟩≡
// << Set initial values of key variables, 21 >>
#ifdef INIT
// << Initialize table entries (done by INITEX only), 164 >>
#endif
Section 9
If the first character of a Pascal comment is a dollar sign, Pascal-H treats the comment as a list of “compiler directives” that will affect the translation of this program into machine language. The directives shown below specify full checking and inclusion of the Pascal debugger when is being debugged, but they cause range checking and other redundant code to be eliminated when the production system is being generated. Arithmetic overflow will be detected in all cases.
No ⟨ Compiler directives 9 ⟩.
Section 10
This implementation conforms to the rules of the Pascal User Manual published by Jensen and Wirth in 1975, except where system-dependent code is necessary to make a useful system program, and except in another respect where such conformity would unnecessarily obscure the meaning and clutter up the code: We assume that case statements may include a default case that applies if no matching label is found. Thus, we shall use constructions like
case x of
1: ⟨ code for x = 1 ⟩;
3: ⟨ code for x = 3 ⟩;
othercases ⟨ code for x 1 and x 3 ⟩
endcases
since most Pascal compilers have plugged this hole in the language by incorporating some sort of default mechanism. For example, the Pascal-H compiler allows ‘others:’ as a default label, and other Pascals allow syntaxes like ‘else’ or ‘otherwise’ or ‘otherwise:’, etc. The definitions of othercases and endcases should be changed to agree with local conventions. Note that no semicolon appears before endcases in this program, so the definition of endcases should include a semicolon if the compiler wants one. (Of course, if no default mechanism is available, the case statements of will have to be laboriously extended by listing all remaining cases. People who are stuck with such Pascals have, in fact, done this, successfully but not happily!)
In C, we use
switch
, with the break keyword to end a case. Multiple cases can be combined with subsequentscase something:
. The default case isdefault:
.
Section 11
The following parameters can be changed at compile time to extend or reduce ’s capacity.
They may have different values in INITEX
and in production versions of .
Specific values for the
TRIP
test have been added in blocks#ifdef TRIP
.
#ifdef TRIP
#define MEM_MAX 3000
#define MEM_MIN 1
#else
#define MEM_MAX 30000 // greatest index in TeX's internal |mem| array; must be strictly less than |MAX_HALFWORD|; must be equal to |MEM_TOP| in INITEX, otherwise |>= MEM_TOP|
#define MEM_MIN 0 // smallest index in TeX's internal |mem| array; must be |MIN_HALFWORD| or more; must be equal to |MEM_BOT| in INITEX, otherwise |<= MEM_BOT|
#endif
#define BUF_SIZE 200000// maximum number of characters simultaneously present in current lines of open files and in control sequences between \csname and \endcsname; must not exceed |MAX_HALFWORD|
#ifdef TRIP
#define ERROR_LINE 64
#define HALF_ERROR_LINE 32
#define MAX_PRINT_LINE 72
#else
#define ERROR_LINE 72 // width of context lines on terminal error messages
#define HALF_ERROR_LINE 42 // width of first lines of contexts in terminal error messages; should be between 30 and |ERROR_LINE - 15|
#define MAX_PRINT_LINE 79 // width of longest text lines output; should be at least 60
#endif
#define STACK_SIZE 200 // maximum number of simultaneous input sources
#define MAX_IN_OPEN 6 // maximum number of input files and error insertions that can be going on simultaneously
#define FONT_MAX 75 // maximum internal font number; must not exceed |MAX_QUARTERWORD| and must be at most |FONT_BASE + 256|
#define FONT_MEM_SIZE 20000 // number of words of |font_info| for all fonts
#define PARAM_SIZE 60 // maximum number of simultaneous macro parameters
#define NEST_SIZE 40 // maximum number of semantic levels simultaneously active
#define MAX_STRINGS 3000 // maximum number of strings; must not exceed |MAX_HALFWORD|
#define POOL_SIZE 32000 // maximum number of characters in strings, including all error messages and help texts, and the names of all fonts and control sequences;
#define SAVE_SIZE 600 // space for saving values outside of current group; must be at most |MAX_HALFWORD|
#define TRIE_SIZE 8000 // space for hyphenation patterns; should be larger for INITEX than it is in production versions of TeX
#define TRIE_OP_SIZE 500 // space for "opcodes" in the hyphenation patterns
#define DVI_BUF_SIZE 800 // size of the output buffer; must be a multiple of 8
#define FILE_NAME_SIZE 40 // file names shouldn't be longer than this
Section 12
Like the preceding parameters, the following quantities can be changed at compile time to extend or reduce ’s capacity.
But if they are changed, it is necessary to rerun the initialization program INITEX
to generate new tables for the production program.
One can’t simply make helter-skelter changes to the following constants, since certain rather complex initialization numbers are computed from them.
They are defined here using WEB
macros, instead of being put into Pascal’s const list, in order to emphasize this distinction.
#ifdef TRIP
#define MEM_BOT 1
#define MEM_TOP 3000
#else
#define MEM_BOT 0 // smallest index in the |mem| array dumped by INITEX must not be less than |MEM_MIN|
#define MEM_TOP 30000 // largest index in the |mem| array dumped by INITEX; must be substantially larger than |MEM_BOT| and not greater than |MEM_MAX|
#endif
#define FONT_BASE 0 // smallest internal font number; must not be less than |MIN_QUARTERWORD|
#define HASH_SIZE 2100 // maximum number of control sequences; it should be at most about |(MEM_MAX - MEM_MIN)/10|
#define HASH_PRIME 1777 // a prime number equal to about 85% of |HASH_SIZE|
#define HYPH_SIZE 307 // another prime; the number of \hyphenation exceptions
Section 13
In case somebody has inadvertently made bad settings of the “constants”, checks them using a global variable called bad.
This is the first of many sections of where global variables are defined.
⟨ Global variables 13 ⟩≡
int bad; // is some "constant" wrong?
Section 14
Later on we will say ‘if MEM_MAX MAX_HALFWORD then bad ← 14’, or something similar. (We can’t do that until MAX_HALFWORD has been defined.)
⟨ Check the “constant” values for consistency 14 ⟩≡
bad = 0;
if (HALF_ERROR_LINE < 30 || HALF_ERROR_LINE > ERROR_LINE - 15) {
bad = 1;
}
if (MAX_PRINT_LINE < 60) {
bad = 2;
}
if (DVI_BUF_SIZE % 8 != 0) {
bad = 3;
}
if (MEM_BOT + 1100 > MEM_TOP) {
bad = 4;
}
if (HASH_PRIME > HASH_SIZE) {
bad = 5;
}
if (MAX_IN_OPEN >= 128) {
bad = 6;
}
if (MEM_TOP < 256 + 11) {
bad = 7; // we will want |NULL_LIST > 255|
}
Section 15
Labels are given symbolic names by the following definitions, so that occasional goto statements will be meaningful. We insert the label ‘exit’ just before the ‘end’ of a procedure in which we have used the ‘return’ statement defined below; the label ‘restart’ is occasionally used at the very beginning of a procedure; and the label ‘reswitch’ is occasionally used just prior to a case statement in which some cases change the conditions and we wish to branch to the newly applicable case. Loops that are set up with the loop construction defined below are commonly exited by going to ‘done’ or to ‘found’ or to ‘not_found’, and they are sometimes repeated by going to ‘continue’. If two or more parts of a subroutine start differently but end up the same, the shared code may be gathered together at ‘common_ending’.
Incidentally, this program never declares a label that isn’t actually used, because some fussy Pascal compilers will complain about redundant labels.
The use of label ‘exit’ is replaced directly by a
return
, except in at least one case.Sometimes, the label ‘done’ is replaced by a
break
when possible. The position of the label is still given in a comment for information.The ‘loop’ is replaced by
while(true)
.Finally, the label ‘continue’ is called ‘continue_lbl’ because it is a C keyword, except when it can be used directly as the keyword
continue
.
Section 16
Here are some macros for common programming idioms.
odd and abs definitions are added here.
#define EMPTY 0 // symbolic name for a null constant
#define incr(X) (X) += 1 // increase a variable by unity
#define decr(X) (X) -= 1 // decrease a variable by unity
#define negate(X) (X) = -(X) // change the sign of a variable
#define do_nothing
#define odd(X) ((X) & 1)
#define abs(X) (((X) >= 0) ? (X) : -(X))