Section 289: Token lists

A $T E X$ token is either a character or a control sequence, and it is represented internally in one of two ways: (1) A character whose ASCII code number is c and whose command code is m is represented as the number $2^{8}$ m + c; the command code is in the range 1 $\leq$ m $\leq$ 14. (2) A control sequence whose eqtb address is p is represented as the number CS_TOKEN_FLAG + p. Here CS_TOKEN_FLAG = $2^{12} - 1$ is larger than $2^{8}$ m + c, yet it is small enough that CS_TOKEN_FLAG + p $<$ MAX_HALFWORD; thus, a token fits comfortably in a halfword.

A token t represents a LEFT_BRACE command if and only if t $<$ LEFT_BRACE_LIMIT; it represents a RIGHT_BRACE command if and only if we have LEFT_BRACE_LIMIT $\leq$ t $<$ RIGHT_BRACE_LIMIT; and it represents a MATCH or END_MATCH command if and only if MATCH_TOKEN $\leq$ t $\leq$ END_MATCH_TOKEN. The following definitions take care of these token-oriented constants and a few others.

constants.h

#define CS_TOKEN_FLAG      0xfff  // amount added to the |eqtb| location in a token that stands for a control sequence; is a multiple of 256, less 1
#define LEFT_BRACE_TOKEN   0x100  // 2^8 * |LEFT_BRACE|
#define LEFT_BRACE_LIMIT   0x200  // 2^8 * (|LEFT_BRACE| + 1)
#define RIGHT_BRACE_TOKEN  0x200  // 2^8 * |RIGHT_BRACE|
#define RIGHT_BRACE_LIMIT  0x300  // 2^8 * (|RIGHT_BRACE| + 1)
#define MATH_SHIFT_TOKEN   0x300  // 2^8 * |MATH_SHIFT|
#define TAB_TOKEN          0x400  // 2^8 * |TAB_MARK|
#define OUT_PARAM_TOKEN    0x500  // 2^8 * |OUT_PARAM|
#define SPACE_TOKEN        0xa20  // 2^8 * |SPACER| + |' '|
#define LETTER_TOKEN       0xb00  // 2^8 * |LETTER|
#define OTHER_TOKEN        0xc00  // 2^8 * |OTHER_CHAR|
#define MATCH_TOKEN        0xd00  // 2^8 * |MATCH|
#define END_MATCH_TOKEN    0xe00  // 2^8 * |END_MATCH|

Section 290

⟨ Check the “constant” values for consistency 14 ⟩+≡

if (CS_TOKEN_FLAG + UNDEFINED_CONTROL_SEQUENCE > MAX_HALFWORD) {
    bad = 21;
}

Section 291

A token list is a singly linked list of one-word nodes in mem, where each word contains a token and a link. Macro definitions, output-routine definitions, marks, \write texts, and a few other things are remembered by $T E X$ in the form of token lists, usually preceded by a node with a reference count in its token_ref_count field. The token stored in location p is called info(p).

Three special commands appear in the token lists of macro definitions. When m = MATCH, it means that $T E X$ should scan a parameter for the current macro; when m = END_MATCH, it means that parameter matching should end and $T E X$ should start reading the macro text; and when m = OUT_PARAM, it means that $T E X$ should insert parameter number c into the text at this point.

The enclosing { and } characters of a macro definition are omitted, but an output routine will be enclosed in braces.

Here is an example macro definition that illustrates these conventions. After $T E X$ processes the text

\def\mac a#1#2 \b {#1\-a ##1#2 #2}

the definition of \mac is represented as a token list containing

(reference count), LETTER a, MATCH #, MATCH #, SPACER ␣, \b, END_MATCH,
OUT_PARAM 1, \-, LETTER a, SPACER ␣, MAC_PARAM #, OTHER_CHAR 1,
OUT_PARAM 2, SPACER ␣, OUT_PARAM 2.

The procedure scan_toks builds such token lists, and macro_call does the parameter matching.

Examples such as

\def\m{\def\m{a} b}

explain why reference counts would be needed even if $T E X$ had no \let operation: When the token list for \m is being read, the redefinition of \m changes the eqtb entry before the token list has been fully consumed, so we dare not simply destroy a token list when its control sequence is being redefined.

If the parameter-matching part of a definition ends with ‘#{’, the corresponding token list will have ‘{’ just before the ‘END_MATCH’ and also at the very end. The first ‘{’ is used to delimit the parameter; the second one keeps the first from disappearing.

Section 292

The procedure show_token_list, which prints a symbolic form of the token list that starts at a given node p, illustrates these conventions. The token list being displayed should not begin with a reference count. However, the procedure is intended to be robust, so that if the memory links are awry or if p is not really a pointer to a token list, nothing catastrophic will happen.

An additional parameter q is also given; this parameter is either null or it points to a node in the token list where a certain magic computation takes place that will be explained later. (Basically, q is non-null when we are printing the two-line context information at the time of an error message; q marks the place corresponding to where the second line should begin.)

For example, if p points to the node containing the first a in the token list above, then show_token_list will print the string

‘a#1#2 \b ->#1\-a ##1#2 #2’;

and if q points to the node containing the second a, the magic computation will be performed just before the second a is printed.

The generation will stop, and ‘\ETC.’ will be printed, if the length of printing exceeds a given limit l. Anomalous entries are printed in the form of control sequences that are not followed by a blank space, e.g., ‘\BAD.’; this cannot be confused with actual control sequences because a real control sequence named BAD would come out ‘\BAD ’.

display_tokens.c

// << Start file |display_tokens.c|, 1382 >>

void show_token_list(int p, int q, int l) {
    int m, c;             // pieces of a token
    ASCII_code match_chr; // character used in a '|MATCH|'
    ASCII_code n;         // the highest parameter number, as an ASCII digit}
    match_chr = '#';
    n = '0';
    tally = 0;
    while (p != null && tally < l) {
        if (p == q) {
            // << Do magic computation, 320 >>
        }
        // << Display token |p|, and |return| if there are problems, 293 >>
        p = link(p);
    }
    if (p != null) {
        print_esc("ETC.");
    }
}

Section 293

⟨ Display token p, and return if there are problems 293 ⟩≡

if ( p < hi_mem_min || p > mem_end) {
    print_esc("CLOBBERED.");
    return;
}
if (info(p) >= CS_TOKEN_FLAG) {
    print_cs(info(p) - CS_TOKEN_FLAG);
}
else {
    m = info(p) / 256;
    c = info(p) % 256;
    if (info(p) < 0) {
        print_esc("BAD.");
    }
    else {
        // << Display the token (|m|, |c|), 294 >>
    }
}

Section 294

The procedure usually “learns” the character code used for macro parameters by seeing one in a MATCH command before it runs into any OUT_PARAM commands.

⟨ Display the token (m, c) 294 ⟩≡

switch(m) {
case LEFT_BRACE:
case RIGHT_BRACE:
case MATH_SHIFT:
case TAB_MARK:
case SUP_MARK:
case SUB_MARK:
case SPACER:
case LETTER:
case OTHER_CHAR:
    print_strnumber(c);
    break;

case MAC_PARAM:
    print_strnumber(c);
    print_strnumber(c);
    break;

case OUT_PARAM:
    print_strnumber(match_chr);
    if (c <= 9) {
        print_char(c + '0');
    }
    else {
        print_char('!');
        return;
    }
    break;

case MATCH:
    match_chr = c;
    print_strnumber(c);
    incr(n);
    print_char(n);
    if (n > '9') {
        return;
    }
    break;

case END_MATCH:
    print("->");
    break;

default:
    print_esc("BAD.");
}

Section 295

Here’s the way we sometimes want to display a token list, given a pointer to its reference count; the pointer may be null.

display_tokens.c

void token_show(pointer p) {
    if (p != null) {
        show_token_list(link(p), null, 10000000);
    }
}

Section 296

The print_meaning subroutine displays cur_cmd and cur_chr in symbolic form, including the expansion of a macro or mark.

display_tokens.c

void print_meaning() {
    print_cmd_chr(cur_cmd, cur_chr);
    if (cur_cmd >= CALL) {
        print_char(':');
        print_ln();
        token_show(cur_chr);
    }
    else if (cur_cmd == TOP_BOT_MARK) {
        print_char(':');
        print_ln();
        token_show(cur_mark[cur_chr]);
    }
}

TeX in C