Section 891: Pre-hyphenation.

When the line-breaking routine is unable to find a feasible sequence of breakpoints, it makes a second pass over the paragraph, attempting to hyphenate the hyphenatable words. The goal of hyphenation is to insert discretionary material into the paragraph so that there are more potential places to break.

The general rules for hyphenation are somewhat complex and technical, because we want to be able to hyphenate words that are preceded or followed by punctuation marks, and because we want the rules to work for languages other than English. We also must contend with the fact that hyphens might radically alter the ligature and kerning structure of a word.

A sequence of characters will be considered for hyphenation only if it belongs to a “potentially hyphenatable part” of the current paragraph. This is a sequence of nodes $p_{0} p_{1} \dots p_{m}$ where $p_{0}$ is a glue node, $p_{1} \dots p_{m - 1}$ are either character or ligature or whatsit or implicit kern nodes, and $p_{m}$ is a glue or penalty or insertion or adjust or mark or whatsit or explicit kern node. (Therefore hyphenation is disabled by boxes, math formulas, and discretionary nodes already inserted by the user.) The ligature nodes among $p_{1} \dots p_{m - 1}$ are effectively expanded into the original non-ligature characters; the kern nodes and whatsits are ignored. Each character c is now classified as either a nonletter (if lc_code(c) = 0), a lowercase letter (if lc_code(c) = c), or an uppercase letter (otherwise); an uppercase letter is treated as if it were lc_code(c) for purposes of hyphenation. The characters generated by $p_{1} \dots p_{m - 1}$ may begin with nonletters; let $c_{1}$ be the first letter that is not in the middle of a ligature. Whatsit nodes preceding $c_{1}$ are ignored; a whatsit found after $c_{1}$ will be the terminating node $p_{m}$ . All characters that do not have the same font as $c_{1}$ will be treated as nonletters. The hyphen_char for that font must be between 0 and 255, otherwise hyphenation will not be attempted. $T E X$ looks ahead for as many consecutive letters $c_{1} \dots c_{n}$ as possible; however, n must be less than 64, so a character that would otherwise be $c_{64}$ is effectively not a letter. Furthermore $c_{n}$ must not be in the middle of a ligature. In this way we obtain a string of letters $c_{1} \dots c_{n}$ that are generated by nodes $p_{a} \dots p_{b}$ , where 1 $\leq$ a $\leq$ b + 1 $\leq$ m. If n $\geq$ l_hyf + r_hyf, this string qualifies for hyphenation; however, uc_hyph must be positive, if $c_{1}$ is uppercase.

The hyphenation process takes place in three stages. First, the candidate sequence $c_{1} \dots c_{n}$ is found; then potential positions for hyphens are determined by referring to hyphenation tables; and finally, the nodes $p_{a} \dots p_{b}$ are replaced by a new sequence of nodes that includes the discretionary breaks found.

Fortunately, we do not have to do all this calculation very often, because of the way it has been taken out of $T E X$ ’s inner loop. For example, when the second edition of the author’s 700-page book Seminumerical Algorithms was typeset by $T E X$ , only about 1.2 hyphenations needed to be tried per paragraph, since the line breaking algorithm needed to use two passes on only about 5 per cent of the paragraphs.

⟨ Initialize for hyphenating a paragraph 891 ⟩≡

#ifdef INIT
if (trie_not_ready) {
    init_trie();
}
#endif
cur_lang = init_cur_lang;
l_hyf = init_l_hyf;
r_hyf = init_r_hyf;

Section 892

The letters $c_{1} \dots c_{n}$ that are candidates for hyphenation are placed into an array called hc; the number n is placed into hn; pointers to nodes $p_{a - 1}$ and $p_{b}$ in the description above are placed into variables ha and hb; and the font number is placed into hf.

⟨ Global variables 13 ⟩+≡

quarterword hc[66];                       // word to be hyphenated
small_number hn;                          // the number of positions occupied in |hc|; not always a |small_number|
pointer ha, hb;                           // nodes |ha .. hb| should be replaced by the hyphenated result
internal_font_number hf;                  // font number of the letters in |hc|
quarterword hu[64];                       // like |hc|, before conversion to lowercase
int hyf_char;                             // hyphen character of the relevant font
ASCII_code cur_lang, init_cur_lang;       // current hyphenation table of interest
int l_hyf, r_hyf, init_l_hyf, init_r_hyf; // limits on fragment sizes
halfword hyf_bchar;                       // boundary character after c_n

Section 893

Hyphenation routines need a few more local variables.

⟨ Local variables for line breaking 862 ⟩+≡

small_number j; // an index into |hc| or |hu|
unsigned char c; // character being considered for hyphenation

Section 894

When the following code is activated, the line_break procedure is in its second pass, and cur_p points to a glue node.

⟨ Try to hyphenate the following word 894 ⟩≡

prev_s = cur_p;
s = link(prev_s);
if (s != null) {
    // << Skip to node |ha|, or |goto done1| if no hyphenation should be attempted, 896 >>
    if (l_hyf + r_hyf > 63) {
        goto done1;
    }
    // << Skip to node |hb|, putting letters into |hu| and |hc|, 897 >>
    // << Check that the nodes following |hb| permit hyphenation and that at least |l_hyf + r_hyf| letters have been found, otherwise |goto done1|, 899 >>
    hyphenate();
}
done1:

Section 895

hyphenation.c

// << Start file |hyphenation.c|, 1382 >>

// << Declare the function called |reconstitute|, 906 >>

void hyphenate() {
    // << Local variables for hyphenation, 901 >>
    // << Find hyphen locations for the word in |hc|, or |return|, 923 >>
    // << If no hyphens were found, |return|, 902 >>
    // << Replace nodes |ha..hb| by a sequence of nodes that includes the discretionary hyphens, 903 >>
}

Section 896

The first thing we need to do is find the node ha just before the first letter.

⟨ Skip to node ha, or goto done1 if no hyphenation should be attempted 896 ⟩≡

while(true) {
    if (is_char_node(s)) {
        c = character(s);
        hf = font(s);
    }
    else if (type(s) == LIGATURE_NODE) {
        if (lig_ptr(s) == null) {
            goto continue_lbl;
        }
        else {
            q = lig_ptr(s);
            c = character(q);
            hf = font(q);
        }
    }
    else if (type(s) == KERN_NODE && subtype(s) == NORMAL) {
        goto continue_lbl;
    }
    else if (type(s) == WHATSIT_NODE) {
        // << Advance past a whatsit node in the pre-hyphenation loop, 1363 >>
        goto continue_lbl;
    }
    else {
        goto done1;
    }
    if (lc_code(c) != 0) {
        if (lc_code(c) == c || uc_hyph > 0) {
            goto done2;
        }
        else {
            goto done1;
        }
    }
continue_lbl:
    prev_s = s;
    s = link(prev_s);
}
done2:
hyf_char = hyphen_char[hf];
if (hyf_char < 0 || hyf_char > 255) {
    goto done1;
}
ha = prev_s;

Section 897

The word to be hyphenated is now moved to the hu and hc arrays.

⟨ Skip to node hb, putting letters into hu and hc 897 ⟩≡

hn = 0;
while(true) {
    if (is_char_node(s)) {
        if (font(s) != hf) {
            goto done3;
        }
        hyf_bchar = character(s);
        c = hyf_bchar;
        if (lc_code(c) == 0 || hn == 63) {
            goto done3;
        }
        hb = s;
        incr(hn);
        hu[hn] = c;
        hc[hn] = lc_code(c);
        hyf_bchar = NON_CHAR;
    }
    else if (type(s) == LIGATURE_NODE) {
        // << Move the characters of a ligature node to |hu| and |hc|; but |goto done3| if they are not all letters, 898 >>
    }
    else if (type(s) == KERN_NODE && subtype(s) == NORMAL) {
        hb = s;
        hyf_bchar = font_bchar[hf];
    }
    else {
        goto done3;
    }
    s = link(s);
}
done3:

Section 898

We let j be the index of the character being stored when a ligature node is being expanded, since we do not want to advance hn until we are sure that the entire ligature consists of letters. Note that it is possible to get to done3 with hn = 0 and hb not set to any value.

⟨ Move the characters of a ligature node to hu and hc; but goto done3 if they are not all letters 898 ⟩≡

if (font(lig_char(s)) != hf) {
    goto done3;
}
j = hn;
q = lig_ptr(s);
if (q > null) {
    hyf_bchar = character(q);
}
while (q > null) {
    c = character(q);
    if (lc_code(c) == 0 || j == 63) {
        goto done3;
    }
    incr(j);
    hu[j] = c;
    hc[j] = lc_code(c);
    q = link(q);
}
hb = s;
hn = j;
if (odd(subtype(s))) {
    hyf_bchar = font_bchar[hf];
}
else {
    hyf_bchar = NON_CHAR;
}

Section 899

⟨ Check that the nodes following hb permit hyphenation and that at least l_hyf + r_hyf letters have been found, otherwise goto done1 899 ⟩≡

if (hn < l_hyf + r_hyf) {
    goto done1; // |l_hyf| and |r_hyf| are >= 1
}
while(true) {
    if (!(is_char_node(s))) {
        switch (type(s)) {
        case LIGATURE_NODE:
            do_nothing;
            break;
        
        case KERN_NODE:
            if (subtype(s) != NORMAL) {
                goto done4;
            }
            break;

        case WHATSIT_NODE:
        case GLUE_NODE:
        case PENALTY_NODE:
        case INS_NODE:
        case ADJUST_NODE:
        case MARK_NODE:
            goto done4;
        
        default:
            goto done1;
        }
    }
    s = link(s);
}
done4:

TeX in C