Section 332: Getting the next token

The heart of $T E X$ ’s input mechanism is the get_next procedure, which we shall develop in the next few sections of the program. Perhaps we shouldn’t actually call it the “heart”, however, because it really acts as $T E X$ ’s eyes and mouth, reading the source files and gobbling them up. And it also helps $T E X$ to regurgitate stored token lists that are to be processed again.

The main duty of get_next is to input one token and to set cur_cmd and cur_chr to that token’s command code and modifier. Furthermore, if the input token is a control sequence, the eqtb location of that control sequence is stored in cur_cs; otherwise cur_cs is set to zero.

Underlying this simple description is a certain amount of complexity because of all the cases that need to be handled. However, the inner loop of get_next is reasonably short and fast.

When get_next is asked to get the next token of a \read line, it sets cur_cmd = cur_chr = cur_cs = 0 in the case that no more tokens appear on that line. (There might not be any tokens at all, if the end_line_char has ignore as its catcode.)

Section 333

The value of par_loc is the eqtb address of ‘\par’. This quantity is needed because a blank line of input is supposed to be exactly equivalent to the appearance of \par; we must set cur_cs ← par_loc when detecting a blank line.

⟨ Global variables 13 ⟩+≡

pointer par_loc;    // location of \par in eqtb
halfword par_token; // token representing \par

Section 334

⟨ Put each of TeX’s primitives into the hash table 226 ⟩+≡

primitive("par", PAR_END, 256); // cf. |scan_file_name|
par_loc = cur_val;
par_token = CS_TOKEN_FLAG + par_loc;

Section 335

⟨ Cases of print_cmd_chr for symbolic printing of primitives 227 ⟩+≡

case PAR_END:
    print_esc("par");
    break;

Section 336

Before getting into get_next, let’s consider the subroutine that is called when an ‘\outer’ control sequence has been scanned or when the end of a file has been reached. These two cases are distinguished by cur_cs, which is zero at the end of a file.

get_next_token.c

// << Start file |get_next_token.c|, 1382 >>

void check_outer_validity() {
    pointer p; // points to inserted token list
    pointer q; // auxiliary pointer
    if (scanner_status != NORMAL) {
        deletions_allowed = false;
        // << Back up an outer control sequence so that it can be reread, 337 >>
        if (scanner_status > SKIPPING) {
            // << Tell the user what has run away and try to recover, 338 >>
        }
        else {
            print_err("Incomplete ");
            print_cmd_chr(IF_TEST, cur_if);
            print("; all text was ignored after line ");
            print_int(skip_line);
            help3("A forbidden control sequence occurred in skipped text.")
                ("This kind of error happens when you say `\\if...' and forget")
                ("the matching `\\fi'. I've inserted a `\\fi'; this might work.");
            if (cur_cs != 0) {
                cur_cs = 0;
            }
            else {
                help_line[2] = "The file ended while I was skipping conditional text.";
            }
            cur_tok = CS_TOKEN_FLAG + FROZEN_FI;
            ins_error();
        }
        deletions_allowed = true;
    }
}

Section 337

An outer control sequence that occurs in a \read will not be reread, since the error recovery for \read is not very powerful.

⟨ Back up an outer control sequence so that it can be reread 337 ⟩≡

if (cur_cs != 0) {
    if (state == TOKEN_LIST || name < 1 || name > 17) {
        p = get_avail();
        info(p) = CS_TOKEN_FLAG + cur_cs;
        back_list(p); // prepare to read the control sequence again
    }
    cur_cmd = SPACER;
    cur_chr = ' '; // replace it by a space
}

Section 338

⟨ Tell the user what has run away and try to recover 338 ⟩≡

runaway(); // print a definition, argument, or preamble
if (cur_cs == 0) {
    print_err("File ended");
}
else {
    cur_cs = 0;
    print_err("Forbidden control sequence found");
}
print(" while scanning ");
// << Print either 'definition' or 'use' or 'preamble' or 'text', and insert tokens that should lead to recovery, 339 >>
print(" of ");
sprint_cs(warning_index);
help4("I suspect you have forgotten a `}', causing me")
    ("to read past where you wanted me to stop.")
    ("I'll try to recover; but if the error is serious,")
    ("you'd better type `E' or `X' now and fix your file.");
error();

Section 339

The recovery procedure can’t be fully understood without knowing more about the $T E X$ routines that should be aborted, but we can sketch the ideas here: For a runaway definition or a runaway balanced text we will insert a right brace; for a runaway preamble, we will insert a special \cr token and a right brace; and for a runaway argument, we will set long_state to OUTER_CALL and insert \par.

⟨ Print either ‘definition’ or ‘use’ or ‘preamble’ or ‘text’, and insert tokens that should lead to recovery 339 ⟩≡

p = get_avail();
switch (scanner_status) {
case DEFINING:
    print("definition");
    info(p) = RIGHT_BRACE_TOKEN + '}';
    break;

case MATCHING:
    print("use");
    info(p) = par_token;
    long_state = OUTER_CALL;
    break;

case ALIGNING:
    print("preamble");
    info(p) = RIGHT_BRACE_TOKEN + '}';
    q = p;
    p = get_avail();
    link(p) = q;
    info(p) = CS_TOKEN_FLAG + FROZEN_CR;
    align_state = -1000000;
    break;

case ABSORBING:
    print("text");
    info(p) = RIGHT_BRACE_TOKEN + '}';
} // there are no other cases
ins_list(p);

Section 340

We need to mention a procedure here that may be called by get_next.

NOTE

This procedure is firm_up_the_line (declared in a header file).

Section 341

Now we’re ready to take the plunge into get_next itself. Parts of this routine are executed more often than any other instructions of $T E X$ .

get_next_token.c

// sets |cur_cmd|, |cur_chr|, |cur_cs| to next token
void get_next() {
   // restart:  go here to get the next input token
   // switch:   go here to eat the next character from a file
   // reswitch: go here to digest it again
   // start_cs: go here to start looking for a control sequence
   // found:    go here when a control sequence has been found
   // return    go here when the next input token has been got
    int k;            // an index into |buffer|
    halfword t;       // a token
    int cat;          // |cat_code(cur_chr)|, usually
    ASCII_code c, cc; // constituents of a possible expanded code
    int d;            // number of excess characters in an expanded code
restart:
    cur_cs = 0;
    if (state != TOKEN_LIST) {
        // << Input from external file, |goto restart| if no input found, 343 >>
    }
    else {
        // << Input from token list, |goto restart| if end of list or if a parameter needs to be expanded, 357 >>
    }
    // << If an alignment entry has just ended, take appropriate action, 342 >>
}

Section 342

An alignment entry ends when a tab or \cr occurs, provided that the current level of braces is the same as the level that was present at the beginning of that alignment entry; i.e., provided that align_state has returned to the value it had after the $⟨ u_{j} ⟩$ template for that entry.

⟨ If an alignment entry has just ended, take appropriate action 342 ⟩≡

if (cur_cmd <= CAR_RET
    && cur_cmd >= TAB_MARK
    && align_state == 0)
{
    // << Insert the <v_j> template and |goto restart|, 789 >>
}

Section 343

⟨ Input from external file, goto restart if no input found 343 ⟩≡

switch_lbl:
if (loc <= limit) {
    // current line not yet finished
    cur_chr = buffer[loc];
    incr(loc);
reswitch:
    cur_cmd = cat_code(cur_chr);
    // << Change state if necessary, and |goto switch| if the current character should be ignored, or |goto reswitch| if the current character changes to another, 344 >>
}
else {
    state = NEW_LINE;
    // << Move to next line of file, or |goto restart| if there is no next line, or |return| if a \read line has finished, 360 >>
    check_interrupt;
    goto switch_lbl;
}

Section 344

The following 48-way switch accomplishes the scanning quickly, assuming that a decent Pascal compiler has translated the code. Note that the numeric values for MID_LINE, SKIP_BLANKS, and NEW_LINE are spaced apart from each other by MAX_CHAR_CODE + 1, so we can add a character’s command code to the state to get a single number that characterizes both.

parser.h

// << Start file |parser.h|, 1381 >>

#define any_state_plus(X)   \
    case MID_LINE + (X):    \
    case SKIP_BLANKS + (X): \
    case NEW_LINE + (X)

⟨ Change state if necessary, and goto switch if the current character should be ignored, or goto reswitch if the current character changes to another 344 ⟩≡

switch (state + cur_cmd) {
// << Cases where character is ignored, 345 >>
    goto switch_lbl;

any_state_plus(ESCAPE):
    // << Scan a control sequence and set |state = SKIP_BLANKS| or |MID_LINE|, 354 >>
    break;

any_state_plus(ACTIVE_CHAR):
    // << Process an active-character control sequence and set |state = MID_LINE|, 353 >>
    break;

any_state_plus(SUP_MARK):
    // << If this |SUP_MARK| starts an expanded character like ^^A or ^^df, then |goto reswitch|, otherwise set |state = MID_LINE|, 352 >>
    break;

any_state_plus(INVALID_CHAR):
    // << Decry the invalid character and |goto restart|, 346 >>

// << Handle situations involving spaces, braces, changes of state, 347 >>

default:
    do_nothing;
}

Section 345

⟨ Cases where character is ignored 345 ⟩≡

any_state_plus(IGNORE):
case SKIP_BLANKS + SPACER:
case NEW_LINE + SPACER:

Section 346

We go to restart instead of to switch, because state might equal TOKEN_LIST after the error has been dealt with (cf. clear_for_error_prompt).

⟨ Decry the invalid character and goto restart 346 ⟩≡

print_err("Text line contains an invalid character");
help2("A funny symbol that I can't read has just been input.")
    ("Continue, and I'll forget that it ever happened.");
deletions_allowed = false;
error();
deletions_allowed = true;
goto restart;

Section 347

parser.h

#define add_delims_to(X)   \
    case (X) + MATH_SHIFT: \
    case (X) + TAB_MARK:   \
    case (X) + MAC_PARAM:  \
    case (X) + SUB_MARK:   \
    case (X) + LETTER:     \
    case (X) + OTHER_CHAR

⟨ Handle situations involving spaces, braces, changes of state 347 ⟩≡

case MID_LINE + SPACER:
    // << Enter |SKIP_BLANKS| state, emit a space, 349 >>
    break;

case MID_LINE + CAR_RET:
    // << Finish line, emit a space, 348 >>
    break;

case SKIP_BLANKS + CAR_RET:
any_state_plus(COMMENT):
    // << Finish line, |goto switch_lbl|, 350 >>
    
case NEW_LINE + CAR_RET:
    // << Finish line, emit a \par, 351 >>
    break;

case MID_LINE + LEFT_BRACE:
    incr(align_state);
    break;

case SKIP_BLANKS + LEFT_BRACE:
case NEW_LINE + LEFT_BRACE:
    state = MID_LINE;
    incr(align_state);
    break;

case MID_LINE + RIGHT_BRACE:
    decr(align_state);
    break;

case SKIP_BLANKS + RIGHT_BRACE:
case NEW_LINE + RIGHT_BRACE:
    state = MID_LINE;
    decr(align_state);
    break;

add_delims_to(SKIP_BLANKS):
add_delims_to(NEW_LINE):
    state = MID_LINE;
    break;

Section 348

When a character of type SPACER gets through, its character code is changed to ‘␣’ = 32. This means that the ASCII codes for tab and space, and for the space inserted at the end of a line, will be treated alike when macro parameters are being matched. We do this since such characters are indistinguishable on most computer terminal displays.

⟨ Finish line, emit a space 348 ⟩≡

loc = limit + 1;
cur_cmd = SPACER;
cur_chr = ' ';

Section 349

The following code is performed only when cur_cmd = SPACER.

⟨ Enter SKIP_BLANKS state, emit a space 349 ⟩≡

state = SKIP_BLANKS;
cur_chr = ' ';

Section 350

⟨ Finish line, goto switch_lbl 350 ⟩≡

loc = limit + 1;
goto switch_lbl;

Section 351

⟨ Finish line, emit a \par 351 ⟩≡

loc = limit + 1;
cur_cs = par_loc;
cur_cmd = eq_type(cur_cs);
cur_chr = equiv(cur_cs);
if (cur_cmd >= OUTER_CALL) {
    check_outer_validity();
}

Section 352

Notice that a code like ^^8 becomes x if not followed by a hex digit.

parser.h

#define is_hex(X) (((X) >= '0' && (X) <= '9' ) || ((X) >= 'a' && (X) <= 'f'))
#define hex_to_cur_chr                            \
    do {                                          \
        if (c <= '9') {                           \
            cur_chr = c - '0';                    \
        }                                         \
        else {                                    \
            cur_chr = c - 'a' + 10;               \
        }                                         \
        if (cc <= '9') {                          \
            cur_chr = 16*cur_chr + cc - '0';      \
        }                                         \
        else {                                    \
            cur_chr = 16*cur_chr + cc - 'a' + 10; \
        }                                         \
    } while (0)

⟨ If this SUP_MARK starts an expanded character like ^^A or ^^df, then goto reswitch, otherwise set state = MID_LINE 352 ⟩≡

if (cur_chr == buffer[loc] && loc < limit) {
    c = buffer[loc + 1];
    if (c < 128) {
        // yes we have an expanded char
        loc += 2;
        if (is_hex(c) && loc <= limit) {
            cc = buffer[loc];
            if (is_hex(cc)) {
                incr(loc);
                hex_to_cur_chr;
                goto reswitch;
            }
        }
        if (c < 64) {
            cur_chr = c + 64;
        }
        else {
            cur_chr = c - 64;
        }
        goto reswitch;
    }
}
state = MID_LINE;

Section 353

⟨ Process an active-character control sequence and set state = MID_LINE 353 ⟩≡

cur_cs = cur_chr + ACTIVE_BASE;
cur_cmd = eq_type(cur_cs);
cur_chr = equiv(cur_cs);
state = MID_LINE;
if (cur_cmd >= OUTER_CALL) {
    check_outer_validity();
}

Section 354

Control sequence names are scanned only when they appear in some line of a file; once they have been scanned the first time, their eqtb location serves as a unique identification, so $T E X$ doesn’t need to refer to the original name any more except when it prints the equivalent in symbolic form.

The program that scans a control sequence has been written carefully in order to avoid the blowups that might otherwise occur if a malicious user tried something like ‘\catcode`15 = 0’. The algorithm might look at buffer[limit + 1], but it never looks at buffer[limit + 2].

If expanded characters like ‘^^A’ or ‘^^df’ appear in or just following a control sequence name, they are converted to single characters in the buffer and the process is repeated, slowly but surely.

⟨ Scan a control sequence and set state = SKIP_BLANKS or MID_LINE 354 ⟩≡

if (loc > limit) {
    // |state| is irrelevant in this case
    cur_cs = NULL_CS;
}
else {
start_cs:
    k = loc;
    cur_chr = buffer[k];
    cat = cat_code(cur_chr);
    incr(k);
    if (cat == LETTER) {
        state = SKIP_BLANKS;
    }
    else if (cat == SPACER) {
        state = SKIP_BLANKS;
    }
    else {
        state = MID_LINE;
    }
    if (cat == LETTER && k <= limit) {
        // << Scan ahead in the buffer until finding a nonletter; if an expanded code is encountered, reduce it and |goto start_cs|; otherwise if a multiletter control sequence is found, adjust |cur_cs| and |loc|, and |goto found|, 356 >>
    }
    else {
        // << If an expanded code is present, reduce it and |goto start_cs|, 355 >>
    }
    cur_cs = SINGLE_BASE + buffer[loc];
    incr(loc);
}
found:
cur_cmd = eq_type(cur_cs);
cur_chr = equiv(cur_cs);
if (cur_cmd >= OUTER_CALL) {
    check_outer_validity();
}

Section 355

Whenever we reach the following piece of code, we will have cur_chr = buffer[k − 1] and k $\leq$ limit + 1 and cat = cat_code(cur_chr). If an expanded code like ^^A or ^^df appears in buffer[(k − 1) .. (k + 1)] or buffer[(k − 1) .. (k + 2)], we will store the corresponding code in buffer[k − 1] and shift the rest of the buffer left two or three places.

⟨ If an expanded code is present, reduce it and goto start_cs 355 ⟩≡

if (buffer[k] == cur_chr
    && cat == SUP_MARK
    && k < limit)
{
    c = buffer[k + 1];
    if (c < 128) {
         // yes, one is indeed present
        d = 2;
        if (is_hex(c) && k + 2 <= limit) {
            cc = buffer[k + 2];
            if (is_hex(cc)) {
                incr(d);
            }
        }
        if (d > 2) {
            hex_to_cur_chr;
            buffer[k - 1] = cur_chr;
        }
        else if (c < 64) {
            buffer[k - 1] = c + 64;
        }
        else {
            buffer[k - 1] = c - 64;
        }
        limit -= d;
        first -= d;
        while (k <= limit) {
            buffer[k] = buffer[k + d];
            incr(k);
        }
        goto start_cs;
    }
}

Section 356

⟨ Scan ahead in the buffer until finding a nonletter; if an expanded code is encountered, reduce it and goto start_cs; otherwise if a multiletter control sequence is found, adjust cur_cs and loc, and goto found 356 ⟩≡

do {
    cur_chr = buffer[k];
    cat = cat_code(cur_chr);
    incr(k);
} while (cat == LETTER && k <= limit);
// << If an expanded code is present, reduce it and |goto start_cs|, 355 >>
if (cat != LETTER) {
    decr(k);
} // now |k| points to first nonletter
if (k > loc + 1) {
    // multiletter control sequence has been scanned
    cur_cs = id_lookup(loc, k - loc);
    loc = k;
    goto found;
}

Section 357

Let’s consider now what happens when get_next is looking at a token list.

⟨ Input from token list, goto restart if end of list or if a parameter needs to be expanded 357 ⟩≡

if (loc != null) {
    // list not exhausted
    t = info(loc);
    loc = link(loc); // move to next
    if (t >= CS_TOKEN_FLAG) {
        // a control sequence token
        cur_cs = t - CS_TOKEN_FLAG;
        cur_cmd = eq_type(cur_cs);
        cur_chr = equiv(cur_cs);
        if (cur_cmd >= OUTER_CALL) {
            if (cur_cmd == DONT_EXPAND) {
                // << Get the next token, suppressing expansion, 358 >>
            }
            else {
                check_outer_validity();
            }
        }
    }
    else {
        cur_cmd = t / 256;
        cur_chr = t % 256;
        switch (cur_cmd) {
        case LEFT_BRACE:
            incr(align_state);
            break;
        
        case RIGHT_BRACE:
            decr(align_state);
            break;
        
        case OUT_PARAM:
            // << Insert macro parameter and |goto restart|, 359 >>
        
        default:
            do_nothing;
        }
    }
}
else {
    // we are done with this token list
    end_token_list();
    goto restart; // resume previous level
}

Section 358

The present point in the program is reached only when the expand routine has inserted a special marker into the input. In this special case, info(loc) is known to be a control sequence token, and link(loc) = null.

constants.h

#define NO_EXPAND_FLAG 257 // this characterizes a special variant of |RELAX|

⟨ Get the next token, suppressing expansion 358 ⟩≡

cur_cs = info(loc) - CS_TOKEN_FLAG;
loc = null;
cur_cmd = eq_type(cur_cs);
cur_chr = equiv(cur_cs);
if (cur_cmd > MAX_COMMAND) {
    cur_cmd = RELAX;
    cur_chr = NO_EXPAND_FLAG;
}

Section 359

⟨ Insert macro parameter and goto restart 359 ⟩≡

begin_token_list(param_stack[param_start + cur_chr - 1], PARAMETER);
goto restart;

Section 360

All of the easy branches of get_next have now been taken care of. There is one more branch.

parser.h

#define end_line_char_inactive (end_line_char < 0 || end_line_char > 255)

⟨ Move to next line of file, or goto restart if there is no next line, or return if a \read line has finished 360 ⟩≡

if (name > 17) {
    // << Read next line of file into |buffer|, or |goto restart| if the file has ended, 362 >>
}
else {
    if (!terminal_input){
        // \read line has ended
        cur_cmd = 0;
        cur_chr = 0;
        return;
    }
    if (input_ptr > 0) {
        // text was inserted during error recovery
        end_file_reading();
        goto restart; // resume previous level
    }
    if (selector < LOG_ONLY) {
        open_log_file();
    }
    if (interaction > NONSTOP_MODE) {
        if (end_line_char_inactive) {
            incr(limit);
        }
        if (limit == start) {
            // previous line was empty
            print_nl("(Please type a command or say `\\end')");
        }
        print_ln();
        first = start;
        prompt_input("*"); // input on-line into |buffer|
        limit = last;
        if (end_line_char_inactive) {
            decr(limit);
        }
        else {
            buffer[limit] = end_line_char;
        }
        first = limit + 1;
        loc = start;
    }
    else {
        // nonstop mode, which is intended for overnight batch processing, never waits for on-line input
        fatal_error("*** (job aborted, no legal \\end found)");
    }
}

Section 361

The global variable force_eof is normally false; it is set true by an \endinput command.

⟨ Global variables 13 ⟩+≡

bool force_eof; // should the next \input be aborted early?

Section 362

⟨ Read next line of file into buffer, or goto restart if the file has ended 362 ⟩≡

incr(line);
first = start;
if (!force_eof) {
    if (input_ln(cur_file)) {
        // not end of file
        firm_up_the_line(); // this sets |limit|
    }
    else {
        force_eof = true;
    }
}
if (force_eof) {
    print_char(')');
    decr(open_parens);
    update_terminal; // show user that file has been read
    force_eof = false;
    end_file_reading(); // resume previous level
    check_outer_validity();
    goto restart;
}
if (end_line_char_inactive) {
    decr(limit);
}
else {
    buffer[limit] = end_line_char;
}
first = limit + 1;
loc = start; // ready to read

Section 363

If the user has set the pausing parameter to some positive value, and if nonstop mode has not been selected, each line of input is displayed on the terminal and the transcript file, followed by ‘=>’. $T E X$ waits for a response. If the response is simply CARRIAGE_RETURN, the line is accepted as it stands, otherwise the line typed is used instead of the line in the file.

get_next_token.c

void firm_up_the_line() {
    int k; // an index into |buffer|
    limit = last;
    if (pausing > 0 && interaction > NONSTOP_MODE) {
        print_ln();
        if (start < limit) {
            for(k = start; k < limit; k++) {
                print_strnumber(buffer[k]);
            }
        }
        first = limit;
        prompt_input("=>"); // wait for user response
        if (last > first) {
            for(k = first; k < last; k++) {
                // move line down in buffer
                buffer[k + start - first] = buffer[k];
            }
            limit = start + last - first;
        }
    }
}

Section 364

Since get_next is used so frequently in $T E X$ , it is convenient to define three related procedures that do a little more:

get_token not only sets cur_cmd and cur_chr, it also sets cur_tok, a packed halfword version of the current token.
get_x_token, meaning “get an expanded token”, is like get_token, but if the current token turns out to be a user-defined control sequence (i.e., a macro call), or a conditional, or something like \topmark or \expandafter or \csname, it is eliminated from the input by beginning the expansion of the macro or the evaluation of the conditional.
x_token is like get_x_token except that it assumes that get_next has already been called.

In fact, these three procedures account for almost every use of get_next.

Section 365

No new control sequences will be defined except during a call of get_token, or when \csname compresses a token list, because no_new_control_sequence is always true at other times.

get_next_token.c

// sets |cur_cmd|, |cur_chr|, |cur_tok|
void get_token() {
    no_new_control_sequence = false;
    get_next();
    no_new_control_sequence = true;
    if (cur_cs == 0) {
        cur_tok = (cur_cmd * 256) + cur_chr;
    }
    else {
        cur_tok = CS_TOKEN_FLAG + cur_cs;
    }
}

TeX in C