Section 300: Input stacks and states

This implementation of $T E X$ uses two different conventions for representing sequential stacks.

If there is frequent access to the top entry, and if the stack is essentially never empty, then the top entry is kept in a global variable (even better would be a machine register), and the other entries appear in the array stack[0 .. (ptr − 1)]. For example, the semantic stack described above is handled this way, and so is the input stack that we are about to study.
If there is infrequent top access, the entire stack contents are in the array stack[0 ..(ptr − 1)]. For example, the save_stack is treated this way, as we have seen.

The state of $T E X$ ’s input mechanism appears in the input stack, whose entries are records with six fields, called state, index, start, loc, limit, and name. This stack is maintained with convention (1), so it is declared in the following way:

⟨ Types in the outer block 18 ⟩+≡

typedef struct {
    quarterword state_field;
    quarterword index_field;
    halfword start_field;
    halfword loc_field;
    halfword limit_field;
    halfword name_field;
} in_state_record;

Section 301

⟨ Global variables 13 ⟩+≡

in_state_record input_stack[STACK_SIZE + 1];
int input_ptr;             // first unused location of |input_stack|
int max_in_stack;          // largest value of |input_ptr| when pushing
in_state_record cur_input; // the "top" input state, according to convention (1)

Section 302

We’ve already defined the special variable loc = cur_input.loc_field in our discussion of basic input-output routines. The other components of cur_input are defined in the same way:

datastructures.h

#define state cur_input.state_field // current scanner state
#define index cur_input.index_field // reference for buffer information
#define start cur_input.start_field // starting position in |buffer|
#define limit cur_input.limit_field // end of current line in |buffer|
#define name  cur_input.name_field  // name of the current file

Section 303

Let’s look more closely now at the control variables (state, index, start, loc, limit, name), assuming that $T E X$ is reading a line of characters that have been input from some file or from the user’s terminal. There is an array called buffer that acts as a stack of all lines of characters that are currently being read from files, including all lines on subsidiary levels of the input stack that are not yet completed. $T E X$ will return to the other lines when it is finished with the present input file.

(Incidentally, on a machine with byte-oriented addressing, it might be appropriate to combine buffer with the str_pool array, letting the buffer entries grow downward from the top of the string pool and checking that these two tables don’t bump into each other.)

The line we are currently working on begins in position start of the buffer; the next character we are about to read is buffer[loc]; and limit is the location of the last character present. If loc $>$ limit, the line has been completely read. Usually buffer[limit] is the end_line_char, denoting the end of a line, but this is not true if the current line is an insertion that was entered on the user’s terminal in response to an error message.

The name variable is a string number that designates the name of the current file, if we are reading a text file. It is zero if we are reading from the terminal; it is n + 1 if we are reading from input stream n, where 0 $\leq$ n $\leq$ 16. (Input stream 16 stands for an invalid stream number; in such cases the input is actually from the terminal, under control of the procedure read_toks.)

The state variable has one of three values, when we are scanning such files:

state = MID_LINE is the normal state.
state = SKIP_BLANKS is like MID_LINE, but blanks are ignored.
state = NEW_LINE is the state at the beginning of a line.

These state values are assigned numeric codes so that if we add the state code to the next character’s command code, we get distinct values. For example, ‘MID_LINE + SPACER’ stands for the case that a blank space character occurs in the middle of a line when it is not being ignored; after this case is processed, the next value of state will be SKIP_BLANKS.

constants.h

#define MID_LINE    1                                   // |state| code when scanning a line of characters
#define SKIP_BLANKS (2 + MAX_CHAR_CODE)                 // |state| code when ignoring blanks
#define NEW_LINE    (3 + MAX_CHAR_CODE + MAX_CHAR_CODE) // |state| code at start of line

Section 304

Additional information about the current line is available via the index variable, which counts how many lines of characters are present in the buffer below the current level. We have index = 0 when reading from the terminal and prompting the user for each line; then if the user types, e.g., ‘\input paper’, we will have index = 1 while reading the file paper.tex. However, it does not follow that index is the same as the input stack pointer, since many of the levels on the input stack may come from token lists. For example, the instruction ‘\input paper’ might occur in a token list.

The global variable in_open is equal to the index value of the highest non-token-list level. Thus, the number of partially read lines in the buffer is in_open + 1, and we have in_open = index when we are not reading a token list.

If we are not currently reading from the terminal, or from an input stream, we are reading from the file variable input_file[index]. We use the notation terminal_input as a convenient abbreviation for name = 0, and cur_file as an abbreviation for input_file[index].

The global variable line contains the line number in the topmost open file, for use in error messages. If we are not reading from the terminal, line_stack[index] holds the line number for the enclosing level, so that line can be restored when the current file has been read. Line numbers should never be negative, since the negative of the current line number is used to identify the user’s output routine in the mode_line field of the semantic nest entries.

If more information about the input state is needed, it can be included in small arrays like those shown here. For example, the current page or segment number in the input file might be put into a variable page, maintained for enclosing levels in ‘page_stack: array [1 .. MAX_IN_OPEN] of integer’ by analogy with line_stack.

NOTE

input_file and line_stack are indexed from 1, so pointer arithmetic again.

datastructures.h

#define terminal_input (name == 0)       // are we reading from the terminal?
#define cur_file       input_file[index] // the current |alpha_file| variable

⟨ Global variables 13 ⟩+≡

int in_open;     // the number of lines in the buffer, less one
int open_parens; // the number of open text files
alpha_file input_file0[MAX_IN_OPEN];
alpha_file *input_file = input_file0 - 1;
int line; // current line number in the current source file
int line_stack0[MAX_IN_OPEN];
int *line_stack = line_stack0 - 1;

Section 305

Users of $T E X$ sometimes forget to balance left and right braces properly, and one of the ways $T E X$ tries to spot such errors is by considering an input file as broken into subfiles by control sequences that are declared to be \outer.

A variable called scanner_status tells $T E X$ whether or not to complain when a subfile ends. This variable has six possible values:

NORMAL, means that a subfile can safely end here without incident.
SKIPPING, means that a subfile can safely end here, but not a file, because we’re reading past some conditional text that was not selected.
DEFINING, means that a subfile shouldn’t end now because a macro is being defined.
MATCHING, means that a subfile shouldn’t end now because a macro is being used and we are searching for the end of its arguments.
ALIGNING, means that a subfile shouldn’t end now because we are not finished with the preamble of an \halign or \valign.
ABSORBING, means that a subfile shouldn’t end now because we are reading a balanced token list for \message, \write, etc.

If the scanner_status is not NORMAL, the variable warning_index points to the eqtb location for the relevant control sequence name to print in an error message.

constants.h

#define SKIPPING  1 // |scanner_status| when passing conditional text
#define DEFINING  2 // |scanner_status| when reading a macro definition
#define MATCHING  3 // |scanner_status| when reading macro arguments
#define ALIGNING  4 // |scanner_status| when reading an alignment preamble
#define ABSORBING 5 // |scanner_status| when reading a balanced text

⟨ Global variables 13 ⟩+≡

int scanner_status;    // can a subfile end now?
pointer warning_index; // identifier relevant to non-|NORMAL| scanner status
pointer def_ref;       // reference count of token list being defined

Section 306

Here is a procedure that uses scanner_status to print a warning message when a subfile has ended, and at certain other crucial times:

stack.c

void runaway() {
    pointer p = null; // head of runaway list
    if (scanner_status > SKIPPING) {
        print_nl("Runaway ");
        switch (scanner_status) {
        case DEFINING:
            print("definition");
            p = def_ref;
            break;
        
        case MATCHING:
            print("argument");
            p = TEMP_HEAD;
            break;
        
        case ALIGNING:
            print("preamble");
            p = HOLD_HEAD;
            break;
        
        case ABSORBING:
            print("text");
            p = def_ref;
        } // there are no other cases
        print_char('?');
        print_ln();
        show_token_list(link(p), null, ERROR_LINE - 10);
    }
}

Section 307

However, all this discussion about input state really applies only to the case that we are inputting from a file. There is another important case, namely when we are currently getting input from a token list. In this case state = TOKEN_LIST, and the conventions about the other state variables are different:

loc is a pointer to the current node in the token list, i.e., the node that will be read next. If loc = null, the token list has been fully read.
start points to the first node of the token list; this node may or may not contain a reference count, depending on the type of token list involved.
token_type, which takes the place of index in the discussion above, is a code number that explains what kind of token list is being scanned.
name points to the eqtb address of the control sequence being expanded, if the current token list is a macro.
param_start, which takes the place of limit, tells where the parameters of the current macro begin in the param_stack, if the current token list is a macro.

The token_type can take several values, depending on where the current token list came from:

PARAMETER, if a parameter is being scanned;
U_TEMPLATE, if the $⟨ u_{j} ⟩$ part of an alignment template is being scanned;
V_TEMPLATE, if the $⟨ v_{j} ⟩$ part of an alignment template is being scanned;
BACKED_UP, if the token list being scanned has been inserted as ‘to be read again’;
INSERTED, if the token list being scanned has been inserted as the text expansion of a \count or similar variable;
MACRO, if a user-defined control sequence is being scanned;
OUTPUT_TEXT, if an \output routine is being scanned;
EVERY_PAR_TEXT, if the text of \everypar is being scanned;
EVERY_MATH_TEXT, if the text of \everymath is being scanned;
EVERY_DISPLAY_TEXT, if the text of \everydisplay is being scanned;
EVERY_HBOX_TEXT, if the text of \everyhbox is being scanned;
EVERY_VBOX_TEXT, if the text of \everyvbox is being scanned;
EVERY_JOB_TEXT, if the text of \everyjob is being scanned;
EVERY_CR_TEXT, if the text of \everycr is being scanned;
MARK_TEXT, if the text of a \mark is being scanned;
WRITE_TEXT, if the text of a \write is being scanned.

The codes for OUTPUT_TEXT, EVERY_PAR_TEXT, etc., are equal to a constant plus the corresponding codes for token list parameters OUTPUT_ROUTINE_LOC, EVERY_PAR_LOC, etc. The token list begins with a reference count if and only if token_type $\geq$ MACRO.

datastructures.h

#define token_type  index // type of current token list
#define param_start limit // base of macro parameters in |param_stack|

constants.h

#define TOKEN_LIST         0  // |state| code when scanning a token list
#define PARAMETER          0  // |token_type| code for parameter
#define U_TEMPLATE         1  // |token_type| code for <u_j> template
#define V_TEMPLATE         2  // |token_type| code for <v_j> template
#define BACKED_UP          3  // |token_type| code for text to be reread
#define INSERTED           4  // |token_type| code for inserted texts
#define MACRO              5  // |token_type| code for defined control sequences
#define OUTPUT_TEXT        6  // |token_type| code for output routines
#define EVERY_PAR_TEXT     7  // |token_type| code for \everypar
#define EVERY_MATH_TEXT    8  // |token_type| code for \everymath
#define EVERY_DISPLAY_TEXT 9  // |token_type| code for \everydisplay
#define EVERY_HBOX_TEXT    10 // |token_type| code for \everyhbox
#define EVERY_VBOX_TEXT    11 // |token_type| code for \everyvbox
#define EVERY_JOB_TEXT     12 // |token_type| code for \everyjob
#define EVERY_CR_TEXT      13 // |token_type| code for \everycr
#define MARK_TEXT          14 // |token_type| code for \topmark, etc.
#define WRITE_TEXT         15 // |token_type| code for \write

Section 308

The param_stack is an auxiliary array used to hold pointers to the token lists for parameters at the current level and subsidiary levels of input. This stack is maintained with convention (2), and it grows at a different rate from the others.

⟨ Global variables 13 ⟩+≡

pointer param_stack[PARAM_SIZE + 1]; // token list pointers for parameters
int param_ptr;                       // first unused entry in |param_stack|
int max_param_stack;                 // largest value of |param_ptr|, will be |<= PARAM_SIZE + 9|

Section 309

The input routines must also interact with the processing of halign and \valign, since the appearance of tab marks and \cr in certain places is supposed to trigger the beginning of special $⟨ v_{j} ⟩$ template text in the scanner. This magic is accomplished by an align_state variable that is increased by 1 when a ‘{’ is scanned and decreased by 1 when a ‘}’ is scanned. The align_state is nonzero during the $⟨ u_{j} ⟩$ template, after which it is set to zero; the $⟨ v_{j} ⟩$ template begins when a tab mark or \cr occurs at a time that align_state = 0.

⟨ Global variables 13 ⟩+≡

int align_state; // group level with respect to current alignment

Section 310

Thus, the “current input state” can be very complicated indeed; there can be many levels and each level can arise in a variety of ways. The show_context procedure, which is used by $T E X$ ’s error-reporting routine to print out the current input state on all levels down to the most recent line of characters from an input file, illustrates most of these conventions. The global variable base_ptr contains the lowest level that was displayed by this procedure.

⟨ Global variables 13 ⟩+≡

int base_ptr; // shallowest level shown by |show_context|

Section 311

The status at each level is indicated by printing two lines, where the first line indicates what was read so far and the second line shows what remains to be read. The context is cropped, if necessary, so that the first line contains at most HALF_ERROR_LINE characters, and the second contains at most ERROR_LINE. Non-current input levels whose token_type is ‘BACKED_UP’ are shown only if they have not been fully read.

display_context.c

// << Start file |display_context.c|, 1382 >>

// prints where the scanner is
void show_context() {
    int old_setting;  // saved |selector| setting
    int nn;           // number of contexts shown so far, less one
    bool bottom_line; // have we reached the final context to be shown?
    // << Local variables for formatting calculations, 315 >>
    
    base_ptr = input_ptr;
    input_stack[base_ptr] = cur_input; // store current state
    nn = -1;
    bottom_line = false;
    while(true) {
        cur_input = input_stack[base_ptr]; // enter into the context
        if (state != TOKEN_LIST
            && (name > 17 || base_ptr == 0))
        {
            bottom_line = true;
        }
        if (base_ptr == input_ptr
            || bottom_line
            || nn < error_context_lines)
        {
            // << Display the current context, 312 >>
        }
        else if (nn == error_context_lines) {
            print_nl("...");
            incr(nn); // omitted if |error_context_lines < 0|
        }
        if (bottom_line) {
            break; // Goto done
        }
        decr(base_ptr);
    }
    // done:
    cur_input = input_stack[input_ptr]; // restore original state
}

Section 312

⟨ Display the current context 312 ⟩≡

if (base_ptr == input_ptr
    || state != TOKEN_LIST
    || token_type != BACKED_UP
    || loc != null)
{
    // we omit backed-up token lists that have already been read
    tally = 0; // get ready to count characters
    old_setting = selector;
    if (state != TOKEN_LIST) {
        // << Print location of current line, 313 >>
        // << Pseudoprint the line, 318 >>
    }
    else {
        // << Print type of token list, 314 >>
        // << Pseudoprint the token list, 319 >>
    }
    selector = old_setting; // stop pseudoprinting
    // << Print two lines using the tricky pseudoprinted information, 317 >>
    incr(nn);
}

Section 313

This routine should be changed, if necessary, to give the best possible indication of where the current line resides in the input file. For example, on some systems it is best to print both a page and line number.

⟨ Print location of current line 313 ⟩≡

if (name <= 17) {
    if (terminal_input) {
        if (base_ptr == 0) {
            print_nl("<*>");
        }
        else {
            print_nl("<insert> ");
        }
    }
    else {
        print_nl("<read ");
        if (name == 17) {
            print_char('*');
        }
        else {
            print_int(name - 1);
        }
        print_char('>');
    }
}
else {
    print_nl("l.");
    print_int(line);
}
print_char(' ');

Section 314

⟨ Print type of token list 314 ⟩≡

switch (token_type) {
case PARAMETER:
    print_nl("<argument> ");
    break;

case U_TEMPLATE:
case V_TEMPLATE:
    print_nl("<template> ");
    break;

case BACKED_UP:
    if (loc == null) {
        print_nl("<recently read> ");
    }
    else {
        print_nl("<to be read again> ");
    }
    break;

case INSERTED:
    print_nl("<inserted text> ");
    break;

case MACRO:
    print_ln();
    print_cs(name);
    break;

case OUTPUT_TEXT:
    print_nl("<output> ");
    break;

case EVERY_PAR_TEXT:
    print_nl("<everypar> ");
    break;

case EVERY_MATH_TEXT:
    print_nl("<everymath> ");
    break;

case EVERY_DISPLAY_TEXT:
    print_nl("<everydisplay> ");
    break;

case EVERY_HBOX_TEXT:
    print_nl("<everyhbox> ");
    break;

case EVERY_VBOX_TEXT:
    print_nl("<everyvbox> ");
    break;

case EVERY_JOB_TEXT:
    print_nl("<everyjob> ");
    break;

case EVERY_CR_TEXT:
    print_nl("<everycr> ");
    break;

case MARK_TEXT:
    print_nl("<mark> ");
    break;

case WRITE_TEXT:
    print_nl("<write> ");
    break;

default:
    print_nl("?"); // this should never happen
}

Section 315

Here it is necessary to explain a little trick. We don’t want to store a long string that corresponds to a token list, because that string might take up lots of memory; and we are printing during a time when an error message is being given, so we dare not do anything that might overflow one of $T E X$ ’s tables. So ‘pseudoprinting’ is the answer: We enter a mode of printing that stores characters into a buffer of length ERROR_LINE, where character k + 1 is placed into trick_buf[k mod ERROR_LINE] if k $<$ trick_count, otherwise character k is dropped. Initially we set tally ← 0 and trick_count ← 1000000; then when we reach the point where transition from line 1 to line 2 should occur, we set first_count ← tally and trick_count ← max(ERROR_LINE, tally + 1 + ERROR_LINE − HALF_ERROR_LINE). At the end of the pseudoprinting, the values of first_count, tally, and trick_count give us all the information we need to print the two lines, and all of the necessary text is in trick_buf.

Namely, let l be the length of the descriptive information that appears on the first line. The length of the context information gathered for that line is k = first_count, and the length of the context information gathered for line 2 is m = min(tally, trick_count) − k. If l + k $\leq$ h, where h = HALF_ERROR_LINE, we print trick_buf[0 .. k − 1] after the descriptive information on line 1, and set n ← l + k; here n is the length of line 1. If l + k $>$ h, some cropping is necessary, so we set n ← h and print ‘...’ followed by

trick_buf[(l + k − h + 3) .. k − 1]

where subscripts of trick_buf are circular modulo ERROR_LINE. The second line consists of n spaces followed by trick_buf[k .. (k + m − 1)], unless n + m $>$ ERROR_LINE; in the latter case, further cropping is done. This is easier to program than to explain.

⟨ Local variables for formatting calculations 315 ⟩≡

int i;  // index into |buffer|
int j;  // end of current line in |buffer|
int l;  // length of descriptive information on line 1
int m;  // context information gathered for line 2
int n;  // length of line 1
int p;  // starting or ending place in |trick_buf|
int q;  // temporary index

Section 316

The following code sets up the print routines so that they will gather the desired information.

io.h

#define begin_pseudoprint  \
    l = tally;             \
    tally = 0;             \
    selector = PSEUDO;     \
    trick_count = 1000000

#define set_trick_count                                         \
    do {                                                        \
        first_count = tally;                                    \
        trick_count = tally + 1 + ERROR_LINE - HALF_ERROR_LINE; \
        if (trick_count < ERROR_LINE) {                         \
            trick_count = ERROR_LINE;                           \
        }                                                       \
    } while (0)

Section 317

And the following code uses the information after it has been gathered.

⟨ Print two lines using the tricky pseudoprinted information 317 ⟩≡

if (trick_count == 1000000) {
    // |set_trick_count| must be performed
    set_trick_count;
}
if (tally < trick_count) {
    m = tally - first_count;
}
else {
    // context on line 2
    m = trick_count - first_count;
}
if (l + first_count <= HALF_ERROR_LINE) {
    p = 0;
    n = l + first_count;
}
else {
    print("...");
    p = l + first_count - HALF_ERROR_LINE + 3;
    n = HALF_ERROR_LINE;
}
for(q = p; q < first_count; q++) {
    print_char(trick_buf[q % ERROR_LINE]);
}
print_ln();
for(q = 1; q <= n; q++) {
    print_char(' '); // print |n| spaces to begin line 2
}
if (m + n <= ERROR_LINE) {
    p = first_count + m;
}
else {
    p = first_count + (ERROR_LINE - n - 3);
}
for(q = first_count; q < p; q++) {
    print_char(trick_buf[q % ERROR_LINE]);
}
if (m + n > ERROR_LINE) {
    print("...");
}

Section 318

But the trick is distracting us from our current goal, which is to understand the input state. So let’s concentrate on the data structures that are being pseudoprinted as we finish up the show_context procedure.

⟨ Pseudoprint the line 318 ⟩≡

begin_pseudoprint;
if (buffer[limit] == end_line_char) {
    j = limit;
}
else {
    j = limit + 1; // determine the effective end of the line
}
if (j > 0) {
    for(i = start; i < j; i++) {
        if (i == loc) {
            set_trick_count;
        }
        print_strnumber(buffer[i]);
    }
}

Section 319

⟨ Pseudoprint the token list 319 ⟩≡

begin_pseudoprint;
if (token_type < MACRO) {
    show_token_list(start, loc, 100000);
}
else {
    show_token_list(link(start), loc, 100000); // avoid reference count
}

Section 320

Here is the missing piece of show_token_list that is activated when the token beginning line 2 is about to be shown:

⟨ Do magic computation 320 ⟩≡

set_trick_count;

TeX in C