Section 256: The hash table

Control sequences are stored and retrieved by means of a fairly standard hash table algorithm called the method of “coalescing lists” (cf. Algorithm 6.4C in The Art of Computer Programming). Once a control sequence enters the table, it is never removed, because there are complicated situations involving \gdef where the removal of a control sequence at the end of a group would be a mistake preventable only by the introduction of a complicated reference-count mechanism.

The actual sequence of letters forming a control sequence identifier is stored in the str_pool array together with all the other strings. An auxiliary array hash consists of items with two halfword fields per word. The first of these, called next(p), points to the next identifier belonging to the same coalesced list as the identifier corresponding to p; and the other, called text(p), points to the str_start entry for p’s identifier. If position p of the hash table is empty, we have text(p) = 0; if position p is either empty or the end of a coalesced hash list, we have next(p) = 0. An auxiliary pointer variable called hash_used is maintained in such a way that all locations p $\geq$ hash_used are nonempty. The global variable cs_count tells how many multiletter control sequences have been defined, if statistics are being kept.

A global boolean variable called no_new_control_sequence is set to true during the time that new hash table entries are forbidden.

datastructures.h

#define next(X)         hh_lh(hash[(X)]) // link for coalesced lists
#define text(X)         hh_rh(hash[(X)]) // string number for control sequence name
#define hash_is_full    (hash_used == HASH_BASE) // test if all positions are occupied
#define font_id_text(X) text(FONT_ID_BASE + (X)) // a frozen font identifier's name

⟨ Global variables 13 ⟩+≡

memory_word hash0[UNDEFINED_CONTROL_SEQUENCE - HASH_BASE]; // the hash table
memory_word *hash = hash0 - HASH_BASE;
pointer hash_used;            // allocation pointer for |hash|
bool no_new_control_sequence; // are new identifiers legal?
int cs_count;                 // total number of known identifiers

Section 257

⟨ Set initial values of key variables 21 ⟩+≡

no_new_control_sequence = true; // new identifiers are usually forbidden
next(HASH_BASE) = 0;
text(HASH_BASE) = 0;
for(k = HASH_BASE + 1;  k < UNDEFINED_CONTROL_SEQUENCE; k++) {
    hash[k] = hash[HASH_BASE];
}

Section 258

NOTE

The "notexpanded:" string must be added in the string pool.

⟨ Read the other strings 51 ⟩+≡

put_string("notexpanded:"); // NOTEXPANDED_STRING: 257

⟨ Internal strings numbers in the pool 51 ⟩+≡

#define NOTEXPANDED_STRING 257

⟨ Initialize table entries (done by INITEX only) 164 ⟩+≡

hash_used = FROZEN_CONTROL_SEQUENCE; // nothing is used
cs_count = 0;
eq_type(FROZEN_DONT_EXPAND) = DONT_EXPAND;
text(FROZEN_DONT_EXPAND) = NOTEXPANDED_STRING;

Section 259

Here is the subroutine that searches the hash table for an identifier that matches a given string of length l $>$ 1 appearing in buffer[j .. (j + l − 1)]. If the identifier is found, the corresponding hash table address is returned. Otherwise, if the global variable no_new_control_sequence is true, the dummy address UNDEFINED_CONTROL_SEQUENCE is returned. Otherwise the identifier is inserted into the hash table and its location is returned.

hash.c

// << Start file |hash.c|, 1382 >>

// search the hash table
pointer id_lookup(int j, int l) {
    int h;     // hash code
    int d;     // number of characters in incomplete current string
    pointer p; // index in |hash| array
    pointer k; // index in |buffer| array

    // << Compute the hash code |h|, 261 >>
    p = h + HASH_BASE; // we start searching here; note that |0 <= h < HASH_PRIME|
    while (true) {
        if (text(p) > 0
            && length(text(p)) == l
            && str_eq_buf(text(p), j))
        {
            break; // Goto found
        }
        if (next(p) == 0) {
            if (no_new_control_sequence) {
                p = UNDEFINED_CONTROL_SEQUENCE;
            }
            else {
                // << Insert a new control sequence after |p|, then make |p| point to it, 260 >>
            }
            break; // Goto found
        }
        p = next(p);
    }
    // found:
    return p;
}

Section 260

⟨ Insert a new control sequence after p, then make p point to it 260 ⟩≡

if (text(p) > 0) {
    do {
        if (hash_is_full) {
            overflow("hash size", HASH_SIZE);
        }
        decr(hash_used);
    } while (text(hash_used) != 0); // search for an empty location in |hash|
    next(p) = hash_used;
    p = hash_used;
}
str_room(l);
d = cur_length;
while (pool_ptr > str_start[str_ptr]) {
    decr(pool_ptr);
    str_pool[pool_ptr + l] = str_pool[pool_ptr];
} // move current string up to make room for another
for(k = j; k < j + l; k++) {
    append_char(buffer[k]);
}
text(p) = make_string();
pool_ptr += d;
#ifdef STAT
incr(cs_count);
#endif

Section 261

The value of HASH_PRIME should be roughly 85% of HASH_SIZE, and it should be a prime number. The theory of hashing tells us to expect fewer than two table probes, on the average, when the search is successful. [See J. S. Vitter, Journal of the ACM 30 (1983), 231–258.]

⟨ Compute the hash code h 261 ⟩≡

h = buffer[j];
for(k = j + 1; k < j + l; k++) {
  h = (h + h + buffer[k]) % HASH_PRIME;
}

Section 262

Single-character control sequences do not need to be looked up in a hash table, since we can use the character code itself as a direct address. The procedure print_cs prints the name of a control sequence, given a pointer to its address in eqtb. A space is printed after the name unless it is a single nonletter or an active character. This procedure might be invoked with invalid data, so it is “extra robust”. The individual characters must be printed one at a time using print, since they may be unprintable.

basic_printing.c

// prints a purported control sequence
void print_cs(int p) {
    if (p < HASH_BASE) {
        // single character
        if (p >= SINGLE_BASE) {
            if (p == NULL_CS) {
                print_esc("csname");
                print_esc("endcsname");
                print_char(' ');
            }
            else {
                print_esc_strnumber(p - SINGLE_BASE);
                if (cat_code(p - SINGLE_BASE) == LETTER) {
                    print_char(' ');
                }
            }
        }
        else if (p < ACTIVE_BASE) {
            print_esc("IMPOSSIBLE.");
        }
        else {
            print_strnumber(p - ACTIVE_BASE);
        }
    }
    else if (p >= UNDEFINED_CONTROL_SEQUENCE) {
        print_esc("IMPOSSIBLE.");
    }
    else if (text(p) < 0 || text(p) >= str_ptr) {
        print_esc("NONEXISTENT.");
    }
    else {
        print_esc_strnumber(text(p));
        print_char(' ');
    }
}

Section 263

Here is a similar procedure; it avoids the error checks, and it never prints a space after the control sequence.

basic_printing.c

// prints a control sequence
void sprint_cs(pointer p) {
    if (p < HASH_BASE) {
        if (p < SINGLE_BASE) {
            print_strnumber(p - ACTIVE_BASE);
        }
        else if (p < NULL_CS) {
            print_esc_strnumber(p - SINGLE_BASE);
        }
        else {
            print_esc("csname");
            print_esc("endcsname");
        }
    }
    else {
        print_esc_strnumber(text(p));
    }
}

Section 264

We need to put $T E X$ ’s “primitive” control sequences into the hash table, together with their command code (which will be the eq_type) and an operand (which will be the equiv). The primitive procedure does this, in a way that no $T E X$ user can. The global value cur_val contains the new eqtb pointer after primitive has acted.

NOTE

id_lookup inserts in str_pool the control sequence. Since TEX.POOL file is not used, it is the first and only insertion in the pool, so flush_string is not needed here.

hash.c

// We take a `char *` as input instead of a pool number
#ifdef INIT
void primitive(char *s, quarterword c, halfword o) {
    int l = strlen(s);
    if (l == 1) {
        cur_val = s[0] + SINGLE_BASE;
    }
    else {
        memcpy(buffer, s, l); // we move |s| into the (empty) |buffer|
        cur_val = id_lookup(0, l); // |no_new_control_sequence| is |false|
    }
    eq_level(cur_val) = LEVEL_ONE;
    eq_type(cur_val) = c;
    equiv(cur_val) = o;
}
#endif

Section 265

Many of $T E X$ ’s primitives need no equiv, since they are identifiable by their eq_type alone. These primitives are loaded into the hash table as follows:

⟨ Put each of TeX’s primitives into the hash table 226 ⟩+≡

primitive(" ", EX_SPACE, 0);
primitive("/", ITAL_CORR, 0);
primitive("accent", ACCENT, 0);
primitive("advance", ADVANCE, 0);
primitive("afterassignment", AFTER_ASSIGNMENT, 0);
primitive("aftergroup", AFTER_GROUP, 0);
primitive("begingroup", BEGIN_GROUP, 0);
primitive("char", CHAR_NUM, 0);
primitive("csname", CS_NAME, 0);
primitive("delimiter", DELIM_NUM, 0);
primitive("divide", DIVIDE, 0);
primitive("endcsname", END_CS_NAME, 0);
primitive("endgroup", END_GROUP, 0);
text(FROZEN_END_GROUP) = str_ptr - 1; // "endgroup"
eqtb[FROZEN_END_GROUP] = eqtb[cur_val];

primitive("expandafter", EXPAND_AFTER, 0);
primitive("font", DEF_FONT, 0);
primitive("fontdimen", ASSIGN_FONT_DIMEN, 0);
primitive("halign", HALIGN, 0);
primitive("hrule", HRULE, 0);
primitive("ignorespaces", IGNORE_SPACES, 0);
primitive("insert", INSERT, 0);
primitive("mark", MARK, 0);
primitive("mathaccent", MATH_ACCENT, 0);
primitive("mathchar", MATH_CHAR_NUM, 0);
primitive("mathchoice", MATH_CHOICE, 0);
primitive("multiply", MULTIPLY, 0);
primitive("noalign", NO_ALIGN, 0);
primitive("noboundary", NO_BOUNDARY, 0);
primitive("noexpand", NO_EXPAND, 0);
primitive("nonscript", NON_SCRIPT, 0);
primitive("omit", OMIT, 0);
primitive("parshape", SET_SHAPE, 0);
primitive("penalty", BREAK_PENALTY, 0);
primitive("prevgraf", SET_PREV_GRAF, 0);
primitive("radical", RADICAL, 0);
primitive("read", READ_TO_CS, 0);
primitive("relax", RELAX, 256); // cf. |scan_file_name|
text(FROZEN_RELAX) = str_ptr - 1; // "relax"
eqtb[FROZEN_RELAX] = eqtb[cur_val];

primitive("setbox", SET_BOX, 0);
primitive("the", THE, 0);
primitive("toks", TOKS_REGISTER, 0);
primitive("vadjust", VADJUST, 0);
primitive("valign", VALIGN, 0);
primitive("vcenter", VCENTER, 0);
primitive("vrule", VRULE, 0);

Section 266

Each primitive has a corresponding inverse, so that it is possible to display the cryptic numeric contents of eqtb in symbolic form. Every call of primitive in this program is therefore accompanied by some straightforward code that forms part of the print_cmd_chr routine below.

⟨ Cases of print_cmd_chr for symbolic printing of primitives 227 ⟩+≡

case ACCENT:
    print_esc("accent");
    break;

case ADVANCE:
    print_esc("advance");
    break;

case AFTER_ASSIGNMENT:
    print_esc("afterassignment");
    break;

case AFTER_GROUP:
    print_esc("aftergroup");
    break;

case ASSIGN_FONT_DIMEN:
    print_esc("fontdimen");
    break;

case BEGIN_GROUP:
    print_esc("begingroup");
    break;

case BREAK_PENALTY:
    print_esc("penalty");
    break;

case CHAR_NUM:
    print_esc("char");
    break;

case CS_NAME:
    print_esc("csname");
    break;

case DEF_FONT:
    print_esc("font");
    break;

case DELIM_NUM:
    print_esc("delimiter");
    break;

case DIVIDE:
    print_esc("divide");
    break;

case END_CS_NAME:
    print_esc("endcsname");
    break;

case END_GROUP:
    print_esc("endgroup");
    break;

case EX_SPACE:
    print_esc(" ");
    break;

case EXPAND_AFTER:
    print_esc("expandafter");
    break;

case HALIGN:
    print_esc("halign");
    break;

case HRULE:
    print_esc("hrule");
    break;

case IGNORE_SPACES:
    print_esc("ignorespaces");
    break;

case INSERT:
    print_esc("insert");
    break;

case ITAL_CORR:
    print_esc("/");
    break;

case MARK:
    print_esc("mark");
    break;

case MATH_ACCENT:
    print_esc("mathaccent");
    break;

case MATH_CHAR_NUM:
    print_esc("mathchar");
    break;

case MATH_CHOICE:
    print_esc("mathchoice");
    break;

case MULTIPLY:
    print_esc("multiply");
    break;

case NO_ALIGN:
    print_esc("noalign");
    break;

case NO_BOUNDARY:
    print_esc("noboundary");
    break;

case NO_EXPAND:
    print_esc("noexpand");
    break;

case NON_SCRIPT:
    print_esc("nonscript");
    break;

case OMIT:
    print_esc("omit");
    break;

case RADICAL:
    print_esc("radical");
    break;

case READ_TO_CS:
    print_esc("read");
    break;

case RELAX:
    print_esc("relax");
    break;

case SET_BOX:
    print_esc("setbox");
    break;

case SET_PREV_GRAF:
    print_esc("prevgraf");
    break;

case SET_SHAPE:
    print_esc("parshape");
    break;

case THE:
    print_esc("the");
    break;

case TOKS_REGISTER:
    print_esc("toks");
    break;

case VADJUST:
    print_esc("vadjust");
    break;

case VALIGN:
    print_esc("valign");
    break;

case VCENTER:
    print_esc("vcenter");
    break;

case VRULE:
    print_esc("vrule");
    break;

Section 267

We will deal with the other primitives later, at some point in the program where their eq_type and equiv values are more meaningful. For example, the primitives for math mode will be loaded when we consider the routines that deal with formulas. It is easy to find where each particular primitive was treated by looking in the index at the end; for example, the section where "radical" entered eqtb is listed under ‘\radical primitive’. (Primitives consisting of a single nonalphabetic character, like ‘\/’, are listed under ‘Single-character primitives’.)

Meanwhile, this is a convenient place to catch up on something we were unable to do before the hash table was defined:

⟨ Print the font identifier for font(p) 267 ⟩≡

print_esc_strnumber(font_id_text(font(p)));

TeX in C