Section 942: Initializing the hyphenation tables.

The trie for ’s hyphenation algorithm is built from a sequence of patterns following a \patterns specification. Such a specification is allowed only in INITEX, since the extra memory for auxiliary tables and for the initialization program itself would only clutter up the production version of with a lot of deadwood.

The first step is to build a trie that is linked, instead of packed into sequential storage, so that insertions are readily made. After all patterns have been processed, INITEX compresses the linked trie by identifying common subtries. Finally the trie is packed into the efficient sequential form that the hyphenation algorithm actually uses.

hyph_scan.c
#ifdef INIT
// << Declare procedures for preprocessing hyphenation patterns, 944 >>
#endif

Section 943

Before we discuss trie building in detail, let’s consider the simpler problem of creating the hyf_distance, hyf_num, and hyf_next arrays.

Suppose, for example, that reads the pattern ‘ab2cde1’. This is a pattern of length 5, with in the notation above. We want the corresponding trie_op code v to have hyf_distance[v] = 3, hyf_num[v] = 2, and hyf_next[v] = v’, where the auxiliary trie_op code v’ has hyf_distance[v’] = 0, hyf_num[v’] = 1, and hyf_next[v’] = MIN_QUARTERWORD.

computes an appropriate value v with the new_trie_op subroutine below, by setting

v’new_trie_op(0, 1, MIN_QUARTERWORD), vnew_trie_op(3, 2,v’).

This subroutine looks up its three parameters in a special hash table, assigning a new value only if these three have not appeared before for the current language.

The hash table is called trie_op_hash, and the number of entries it contains is trie_op_ptr.

⟨ Global variables 13 ⟩+≡

#ifdef INIT
// trie op codes for quadruples
int trie_op_hash0[2*TRIE_OP_SIZE + 1];
int *trie_op_hash = trie_op_hash0 + TRIE_OP_SIZE;

// largest opcode used so far for this language
quarterword trie_used[256];

// language part of a hashed quadruple
ASCII_code trie_op_lang0[TRIE_OP_SIZE];
ASCII_code *trie_op_lang = trie_op_lang0 - 1;

// opcode corresponding to a hashed quadruple
quarterword trie_op_val0[TRIE_OP_SIZE];
quarterword *trie_op_val = trie_op_val0 - 1;

// number of stored ops so far
int trie_op_ptr;
#endif

Section 944

It’s tempting to remove the overflow stops in the following procedure; new_trie_op could return MIN_QUARTERWORD (thereby simply ignoring part of a hyphenation pattern) instead of aborting the job. However, that would lead to different hyphenation results on different installations of using the same patterns. The overflow stops are necessary for portability of patterns.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩≡

quarterword new_trie_op(small_number d, small_number n, quarterword v) {
    int h;         // trial hash location
    quarterword u; // trial op code
    int l;         // pointer to stored data
    h = abs(n + 313*d + 361*v + 1009*cur_lang) % (TRIE_OP_SIZE + TRIE_OP_SIZE) - TRIE_OP_SIZE;
    while(true) {
        l = trie_op_hash[h];
        if (l == 0) {
            // empty position found for a new op
            if (trie_op_ptr == TRIE_OP_SIZE) {
                overflow("pattern memory ops", TRIE_OP_SIZE);
            }
            u = trie_used[cur_lang];
            if (u == MAX_QUARTERWORD) {
                overflow("pattern memory ops per language", MAX_QUARTERWORD - MIN_QUARTERWORD);
            }
            incr(trie_op_ptr);
            incr(u);
            trie_used[cur_lang] = u;
            hyf_distance[trie_op_ptr] = d;
            hyf_num[trie_op_ptr] = n;
            hyf_next[trie_op_ptr] = v;
            trie_op_lang[trie_op_ptr] = cur_lang;
            trie_op_hash[h] = trie_op_ptr;
            trie_op_val[trie_op_ptr] = u;
            return u;
        }
        if (hyf_distance[l] == d
            && hyf_num[l] == n
            && hyf_next[l] == v
            && trie_op_lang[l] == cur_lang)
        {
            return trie_op_val[l];
        }
        if (h > -TRIE_OP_SIZE) {
            decr(h);
        }
        else {
            h = TRIE_OP_SIZE;
        }
    }
}

Section 945

After new_trie_op has compressed the necessary opcode information, plenty of information is available to unscramble the data into the final form needed by our hyphenation algorithm.

⟨ Sort the hyphenation op tables into proper order 945 ⟩≡

op_start[0] = -MIN_QUARTERWORD;
for(j = 1; j <= 255; j++) {
    op_start[j] = op_start[j - 1] + trie_used[j - 1];
}
for(j = 1; j <= trie_op_ptr; j++) {
    // destination
    trie_op_hash[j] = op_start[trie_op_lang[j]] + trie_op_val[j];
}
for(j = 1; j <= trie_op_ptr; j++) {
    while (trie_op_hash[j] > j) {
        k = trie_op_hash[j];
        t = hyf_distance[k];
        hyf_distance[k] = hyf_distance[j];
        hyf_distance[j] = t;
        t = hyf_num[k];
        hyf_num[k] = hyf_num[j];
        hyf_num[j] = t;
        t = hyf_next[k];
        hyf_next[k] = hyf_next[j];
        hyf_next[j] = t;
        trie_op_hash[j] = trie_op_hash[k];
        trie_op_hash[k] = k;
    }
}

Section 946

Before we forget how to initialize the data structures that have been mentioned so far, let’s write down the code that gets them started.

⟨ Initialize table entries (done by INITEX only) 164 ⟩+≡

for(k = -TRIE_OP_SIZE; k <= TRIE_OP_SIZE; k++) {
    trie_op_hash[k] = 0;
}
for(k = 0; k <= 255; k++) {
    trie_used[k] = MIN_QUARTERWORD;
}
trie_op_ptr = 0;

Section 947

The linked trie that is used to preprocess hyphenation patterns appears in several global arrays. Each node represents an instruction of the form “if you see character c, then perform operation o, move to the next character, and go to node l; otherwise go to node r”. The four quantities c, o, l, and r are stored in four arrays trie_c, trie_o, trie_l, and trie_r. The root of the trie is trie_l[0], and the number of nodes is trie_ptr. Null trie pointers are represented by zero. To initialize the trie, we simply set trie_l[0] and trie_ptr to zero. We also set trie_c[0] to some arbitrary value, since the algorithm may access it.

The algorithms maintain the condition

trie_c[trie_r[z]] trie_c[z] whenever z 0 and trie_r[z] 0;

in other words, sibling nodes are ordered by their c fields.

parser.h
#define trie_root trie_l[0] // root of the linked trie

⟨ Global variables 13 ⟩+≡

#ifdef INIT
// characters to match
unsigned char trie_c[TRIE_SIZE + 1];

// operations to perform
quarterword trie_o[TRIE_SIZE + 1];

// left subtrie links
int trie_l[TRIE_SIZE + 1];

// right subtrie links
int trie_r[TRIE_SIZE + 1];

// the number of nodes in the trie
int trie_ptr; 

// used to identify equivalent subtries
int trie_hash[TRIE_SIZE + 1];
#endif

Section 948

Let us suppose that a linked trie has already been constructed. Experience shows that we can often reduce its size by recognizing common subtries; therefore another hash table is introduced for this purpose, somewhat similar to trie_op_hash. The new hash table will be initialized to zero.

The function trie_node(p) returns p if p is distinct from other that it has seen, otherwise it returns the number of the first equivalent node that it has seen.

Notice that we might make subtries equivalent even if they correspond to patterns for different languages, in which the trie ops might mean quite different things. That’s perfectly all right.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

// converts to a canonical form
int trie_node(trie_pointer p) {
    int h; // trial hash location
    int q; // trial trie node
    h = abs(trie_c[p] + 1009*trie_o[p] + 2718*trie_l[p] + 3142*trie_r[p]) % TRIE_SIZE;
    while(true) {
        q = trie_hash[h];
        if (q == 0) {
            trie_hash[h] = p;
            return p;
        }
        if (trie_c[q] == trie_c[p]
            && trie_o[q] == trie_o[p]
            && trie_l[q] == trie_l[p]
            && trie_r[q] == trie_r[p])
        {
            return q;
        }
        if (h > 0) {
            decr(h);
        }
        else {
            h = TRIE_SIZE;
        }
    }
}

Section 949

A neat recursive procedure is now able to compress a trie by traversing it and applying trie_node to its nodes in “bottom up” fashion. We will compress the entire trie by clearing trie_hash to zero and then saying ‘trie_rootcompress_trie(trie_root)’.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

int compress_trie(trie_pointer p) {
    if (p == 0) {
        return 0;
    }
    else {
        trie_l[p] = compress_trie(trie_l[p]);
        trie_r[p] = compress_trie(trie_r[p]);
        return trie_node(p);
    }
}

Section 950

The compressed trie will be packed into the trie array using a “top-down first-fit” procedure. This is a little tricky, so the reader should pay close attention: The trie_hash array is cleared to zero again and renamed trie_ref for this phase of the operation; later on, trie_ref[p] will be nonzero only if the linked trie node p is the smallest character in a family and if the characters c of that family have been allocated to locations trie_ref[p] + c in the trie array. Locations of trie that are in use will have trie_link = 0, while the unused holes in trie will be doubly linked with trie_link pointing to the next larger vacant location and trie_back pointing to the next smaller one. This double linking will have been carried out only as far as trie_max, where trie_max is the largest index of trie that will be needed. To save time at the low end of the trie, we maintain array entries trie_min[c] pointing to the smallest hole that is greater than c. Another array trie_taken tells whether or not a given location is equal to trie_ref[p] for some p; this array is used to ensure that distinct nodes in the compressed trie will have distinct trie_ref entries.

parser.h
#define trie_ref     trie_hash      // where linked trie families go into |trie|
#define trie_back(X) hh_lh(trie[X]) // backward links in |trie| holes

⟨ Global variables 13 ⟩+≡

#ifdef INIT
// does a family start here?
bool trie_taken0[TRIE_SIZE];
bool *trie_taken = trie_taken0 - 1;
int trie_min[256];   // the first possible slot for each character
int trie_max;        // largest location used in |trie|
bool trie_not_ready; // is the trie still in linked form?
#endif

Section 951

Each time \patterns appears, it contributes further patterns to the future trie, which will be built only when hyphenation is attempted or when a format file is dumped. The boolean variable trie_not_ready will change to false when the trie is compressed; this will disable further patterns.

⟨ Initialize table entries (done by INITEX only) 164 ⟩+≡

trie_not_ready = true;
trie_root = 0;
trie_c[0] = 0;
trie_ptr = 0;

Section 952

Here is how the trie-compression data structures are initialized. If storage is tight, it would be possible to overlap trie_op_hash, trie_op_lang, and trie_op_val with trie, trie_hash, and trie_taken, because we finish with the former just before we need the latter.

⟨ Get ready to compress the trie 952 ⟩≡

// << Sort the hyphenation op tables into proper order, 945 >>
for(p = 0; p <= TRIE_SIZE; p++) {
    trie_hash[p] = 0;
}
trie_root = compress_trie(trie_root); // identify equivalent subtries
for(p = 0; p <= trie_ptr; p++) {
    trie_ref[p] = 0;
}
for(p = 0; p <= 255; p++) {
    trie_min[p] = p + 1;
}
trie_link(0) = 1;
trie_max = 0;

Section 953

The first_fit procedure finds the smallest hole z in trie such that a trie family starting at a given node p will fit into vacant positions starting at z. If c = trie_c[p], this means that location z - c must not already be taken by some other family, and that z − c + c’ must be vacant for all characters c’ in the family. The procedure sets trie_ref[p] to z − c when the first fit has been found.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

// packs a family into |trie|
void first_fit(trie_pointer p) {
    int h;        // candidate for |trie_ref[p]|
    int z;        // runs through holes
    int q;        // runs through the family starting at |p|
    ASCII_code c; // smallest character in the family
    int l, r;     // left and right neighbors
    int ll;       // upper limit of |trie_min| updating
    c = trie_c[p];
    z = trie_min[c]; // get the first conceivably good hole
    while(true) {
        h = z - c;
        // << Ensure that |trie_max >= h + 256|, 954 >>
        if (trie_taken[h]) {
            goto not_found;
        }
        // << If all characters of the family fit relative to |h|, then |goto found|, otherwise |goto not_found|, 955 >>
not_found:
        z = trie_link(z); // move to the next hole
    }
found:
    // << Pack the family into |trie| relative to |h|, 956 >>
}

Section 954

By making sure that trie_max is at least h + 256, we can be sure that trie_max z, since h = z − c. It follows that location trie_max will never be occupied in trie, and we will have trie_max trie_link(z).

⟨ Ensure that trie_max >= h + 256 954 ⟩≡

if (trie_max < h + 256) {
    if (TRIE_SIZE <= h + 256) {
        overflow("pattern memory", TRIE_SIZE);
    }
    do {
        incr(trie_max);
        trie_taken[trie_max] = false;
        trie_link(trie_max) = trie_max + 1;
        trie_back(trie_max) = trie_max - 1;
    } while (trie_max != h + 256);
}

Section 955

⟨ If all characters of the family fit relative to h, then goto found, otherwise goto not_found 955 ⟩≡

q = trie_r[p];
while (q > 0) {
    if (trie_link(h + trie_c[q]) == 0) {
        goto not_found;
    }
    q = trie_r[q];
}
goto found;

Section 956

⟨ Pack the family into trie relative to h 956 ⟩≡

trie_taken[h] = true;
trie_ref[p] = h;
q = p;
do {
    z = h + trie_c[q];
    l = trie_back(z);
    r = trie_link(z);
    trie_back(r) = l;
    trie_link(l) = r;
    trie_link(z) = 0;
    if (l < 256) {
        if (z < 256) {
            ll = z;
        }
        else {
            ll = 256;
        }
        do {
            trie_min[l] = r;
            incr(l);
        } while (l != ll);
    }
    q = trie_r[q];
} while (q != 0);

Section 957

To pack the entire linked trie, we use the following recursive procedure.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

// pack subtries of a family
void trie_pack(trie_pointer p) {
    int q; // a local variable that need not be saved on recursive calls
    do {
        q = trie_l[p];
        if (q > 0 && trie_ref[q] == 0) {
            first_fit(q);
            trie_pack(q);
        }
        p = trie_r[p];
    } while (p != 0);
}

Section 958

When the whole trie has been allocated into the sequential table, we must go through it once again so that trie contains the correct information. Null pointers in the linked trie will be represented by the value 0, which properly implements an “empty” family.

⟨ Move the data into trie 958 ⟩≡

hh_rh(h) = 0;
hh_b0(h) = MIN_QUARTERWORD;
hh_b1(h) = MIN_QUARTERWORD; // |trie_link = 0|, |trie_op = MIN_QUARTERWORD|, |trie_char = 0|
if (trie_root == 0) {
    // no patterns were given
    for(r = 0; r <= 256; r++) {
        trie[r] = h;
    }
    trie_max = 256;
}
else {
    trie_fix(trie_root); // this fixes the non-holes in |trie|
    r = 0; // now we will zero out all the holes
    do {
        s = trie_link(r);
        trie[r] = h;
        r = s;
    } while (r <= trie_max);
}
trie_char(0) = '?'; // make |trie_char(c) != c| for all |c|

Section 959

The fixing-up procedure is, of course, recursive. Since the linked trie usually has overlapping subtries, the same data may be moved several times; but that causes no harm, and at most as much work is done as it took to build the uncompressed trie.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

// moves |p| and its siblings into |trie|
void trie_fix(trie_pointer p) {
    int q;        // a local variable that need not be saved on recursive calls
    ASCII_code c; // another one that need not be saved
    int z;        // |trie| reference; this local variable must be saved
    z = trie_ref[p];
    do {
        q = trie_l[p];
        c = trie_c[p];
        trie_link(z + c) = trie_ref[q];
        trie_char(z + c) = c;
        trie_op(z + c) = trie_o[p];
        if (q > 0) {
            trie_fix(q);
        }
        p = trie_r[p];
    } while (p != 0);
}

Section 960

Now let’s go back to the easier problem, of building the linked trie. When INITEX has scanned the ‘\patterns’ control sequence, it calls on new_patterns to do the right thing.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

// initializes the hyphenation pattern data
void new_patterns() {
    int k, l;          // indices into |hc| and |hyf|; not always in |small_number| range
    bool digit_sensed; // should the next digit be treated as a letter?
    quarterword v;     // trie op code
    int p, q;          // nodes of trie traversed during insertion
    bool first_child;  // is |p == trie_l[q]|?
    ASCII_code c;      // character being inserted
    
    if (trie_not_ready) {
        set_cur_lang;
        scan_left_brace(); // a left brace must follow \patterns
        // << Enter all of the patterns into a linked trie, until coming to a right brace, 961 >>
    }
    else {
        print_err("Too late for ");
        print_esc("patterns");
        help1("All patterns must be given before typesetting begins.");
        error();
        link(GARBAGE) = scan_toks(false, false);
        flush_list(def_ref);
    }
}

Section 961

Novices are not supposed to be using \patterns, so the error messages are terse. (Note that all error messages appear in ’s string pool, even if they are used only by INITEX.)

⟨ Enter all of the patterns into a linked trie, until coming to a right brace 961 ⟩≡

k = 0;
hyf[0] = 0;
digit_sensed = false;
while(true) {
    get_x_token();
    switch (cur_cmd) {
    case LETTER:
    case OTHER_CHAR:
        // << Append a new letter or a hyphen level, 962 >>
        break;
    
    case SPACER:
    case RIGHT_BRACE:
        if (k > 0) {
            // << Insert a new pattern into the linked trie, 963 >>
        }
        if (cur_cmd == RIGHT_BRACE) {
            goto done;
        }
        k = 0;
        hyf[0] = 0;
        digit_sensed = false;
        break;
  
    default:
        print_err("Bad ");
        print_esc("patterns");
        help1("(See Appendix H.)");
        error();
    }
}
done:

Section 962

⟨ Append a new letter or a hyphen level 962 ⟩≡

if (digit_sensed || cur_chr < '0' || cur_chr > '9') {
    if (cur_chr == '.') {
        cur_chr = 0; // edge-of-word delimiter
    }
    else {
        cur_chr = lc_code(cur_chr);
        if (cur_chr == 0) {
            print_err("Nonletter");
            help1("(See Appendix H.)");
            error();
        }
    }
    if (k < 63) {
        incr(k);
        hc[k] = cur_chr;
        hyf[k] = 0;
        digit_sensed = false;
    }
}
else if (k < 63) {
    hyf[k] = cur_chr - '0';
    digit_sensed = true;
}

Section 963

When the following code comes into play, the pattern appears in hc[1 .. k], and the corresponding sequence of numbers appears in hyf[0 .. k].

⟨ Insert a new pattern into the linked trie 963 ⟩≡

// << Compute the trie op code, |v|, and set |l = 0|, 965 >>
q = 0;
hc[0] = cur_lang;
while (l <= k) {
    c = hc[l];
    incr(l);
    p = trie_l[q];
    first_child = true;
    while (p > 0 && c > trie_c[p]) {
        q = p;
        p = trie_r[q];
        first_child = false;
    }
    if (p == 0 || c < trie_c[p]) {
        // << Insert a new trie node between |q| and |p|, and make |p| point to it, 964 >>
    }
    q = p; // now node |q| represents $p_1\ldots p_{l - 1}$
}
if (trie_o[q] != MIN_QUARTERWORD) {
    print_err("Duplicate pattern");
    help1("(See Appendix H.)");
    error();
}
trie_o[q] = v;

Section 964

⟨ Insert a new trie node between q and p, and make p point to it 964 ⟩≡

if (trie_ptr == TRIE_SIZE) {
    overflow("pattern memory", TRIE_SIZE);
}
incr(trie_ptr);
trie_r[trie_ptr] = p;
p = trie_ptr;
trie_l[p] = 0;
if (first_child) {
    trie_l[q] = p;
}
else {
    trie_r[q] = p;
}
trie_c[p] = c;
trie_o[p] = MIN_QUARTERWORD;

Section 965

⟨ Compute the trie op code, v, and set l = 0 965 ⟩≡

if (hc[1] == 0) {
    hyf[0] = 0;
}
if (hc[k] == 0) {
    hyf[k] = 0;
}
l = k;
v = MIN_QUARTERWORD;
while(true) {
    if (hyf[l] != 0) {
        v = new_trie_op(k - l, hyf[l], v);
    }
    if (l > 0) {
        decr(l);
    }
    else {
        break; // goto done1
    }
}
// done1:

Section 966

Finally we put everything together: Here is how the trie gets to its final, efficient form. The following packing routine is rigged so that the root of the linked tree gets mapped into location 1 of trie, as required by the hyphenation algorithm. This happens because the first call of first_fit will “take” location 1.

⟨ Declare procedures for preprocessing hyphenation patterns 944 ⟩+≡

void init_trie() {
    int p;         // pointer for initialization
    int j, k, t;   // all-purpose registers for initialization
    int r, s;      // used to clean up the packed |trie|
    memory_word h; // template used to zero out |trie|'s holes
    
    // << Get ready to compress the trie, 952 >>
    if (trie_root != 0) {
        first_fit(trie_root);
        trie_pack(trie_root);
    }
    // << Move the data into |trie|, 958 >>
    trie_not_ready = false;
}