Section 919: Hyphenation
When a word hc[1 .. hn] has been set up to contain a candidate for hyphenation, first looks to see if it is in the user’s exception dictionary. If not, hyphens are inserted based on patterns that appear within the given word, using an algorithm due to Frank M. Liang.
Let’s consider Liang’s method first, since it is much more interesting than the exception-lookup routine. The algorithm begins by setting hyf[j] to zero for all j, and invalid characters are inserted into hc[0] and hc[hn + 1] to serve as delimiters. Then a reasonably fast method is used to see which of a given set of patterns occurs in the word hc[0 .. (hn + 1)]. Each pattern of length k has an associated sequence of k + 1 numbers ; and if the pattern occurs in hc[(j + 1) .. (j + k)], will set hyf[j + i] = max(hyf[j + i], ) for 0 i k. After this has been done for each pattern that occurs, a discretionary hyphen will be inserted between hc[j] and hc[j + 1] when hyf[j] is odd, as we have already seen.
The set of patterns and associated numbers depends, of course, on the language whose words are being hyphenated, and on the degree of hyphenation that is desired. A method for finding appropriate p’s and n’s, from a given dictionary of words and acceptable hyphenations, is discussed in Liang’s Ph.D. thesis (Stanford University, 1983); simply starts with the patterns and works from there.
Section 920
The patterns are stored in a compact table that is also efficient for retrieval, using a variant of “trie memory” [cf. The Art of Computer Programming 3 (1973), 481–505]. We can find each pattern by letting be one greater than the relevant language index and then, for 1 i k, setting ← trie_link() + ; the pattern will be identified by the number . Since all the pattern information is packed together into a single trie_link array, it is necessary to prevent confusion between the data from inequivalent patterns, so another table is provided such that trie_char() = for all i. There is also a table trie_op() to identify the numbers associated with .
Comparatively few different number sequences actually occur, since most of the n’s are generally zero. Therefore the number sequences are encoded in such a way that trie_op() is only one byte long. If trie_op() MIN_QUARTERWORD, when has matched the letters in hc[(l − k + 1) .. l] of language t, we perform all of the required operations for this pattern by carrying out the following little program: Set v ← trie_op(). Then set v ← v + op_start[t], hyf[l − hyf_distance[v]] ← max(hyf[l − hyf_distance[v]], hyf_num[v]), and v ← hyf_next[v]; repeat, if necessary, until v = MIN_QUARTERWORD.
⟨ Types in the outer block 18 ⟩+≡
typedef int trie_pointer;
Section 921
#define trie_link(X) hh_rh(trie[(X)]) // "downward" link in a trie
#define trie_char(X) hh_b1(trie[(X)]) // character matched at this trie location
#define trie_op(X) hh_b0(trie[(X)]) // program for hyphenation at this trie location
⟨ Global variables 13 ⟩+≡
// |trie_link|, |trie_char|, |trie_op|
memory_word trie[TRIE_SIZE + 1];
// position |k - j| of $n_j$
small_number hyf_distance0[TRIE_OP_SIZE];
small_number *hyf_distance = hyf_distance0 - 1;
// value of $n_j$
small_number hyf_num0[TRIE_OP_SIZE];
small_number *hyf_num = hyf_num0 - 1;
// continuation code
quarterword hyf_next0[TRIE_OP_SIZE];
quarterword *hyf_next = hyf_next0 - 1;
// offset for current language
int op_start[256];
Section 922
⟨ Local variables for hyphenation 901 ⟩+≡
int z; // an index into |trie|
int v; // an index into |hyf_distance|, etc.
Section 923
Assuming that these auxiliary tables have been set up properly, the hyphenation algorithm is quite short. In the following code we set hc[hn + 2] to the impossible value 256, in order to guarantee that hc[hn + 3] will never be fetched.
⟨ Find hyphen locations for the word in hc, or return 923 ⟩≡
for(j = 0; j <= hn; j++) {
hyf[j] = 0;
}
// << Look for the word |hc[1..hn]| in the exception table, and |goto found| (with |hyf| containing the hyphens) if an entry is found, 930 >>
if (trie_char(cur_lang + 1) != cur_lang) {
return; // no patterns for |cur_lang|
}
hc[0] = 0;
hc[hn + 1] = 0;
hc[hn + 2] = 256; // insert delimiters
for(j = 0; j <= hn - r_hyf + 1; j++) {
z = trie_link(cur_lang + 1) + hc[j];
l = j;
while (hc[l] == trie_char(z)) {
if (trie_op(z) != MIN_QUARTERWORD) {
// << Store maximum values in the |hyf| table, 924 >>
}
incr(l);
z = trie_link(z) + hc[l];
}
}
found:
for(j = 0; j <= l_hyf - 1; j++) {
hyf[j] = 0;
}
for(j = 0; j <= r_hyf - 1; j++) {
hyf[hn - j] = 0;
}
Section 924
⟨ Store maximum values in the hyf table 924 ⟩≡
v = trie_op(z);
do {
v += op_start[cur_lang];
i = l - hyf_distance[v];
if (hyf_num[v] > hyf[i]) {
hyf[i] = hyf_num[v];
}
v = hyf_next[v];
} while (v != MIN_QUARTERWORD);
Section 925
The exception table that is built by ’s \hyphenation
primitive is organized as an ordered hash table [cf. Amble and Knuth, The Computer Journal 17 (1974), 135–142] using linear probing.
If and are words, we will say that if or if and is lexicographically smaller than .
(The notation stands for the length of .)
The idea of ordered hashing is to arrange the table so that a given word can be sought by computing a hash address and then looking in table positions , , , until encountering the first word .
If this word is different from , we can conclude that is not in the table.
The words in the table point to lists in mem that specify hyphen positions in their info fields. The list for contains the number k if the word has a discretionary hyphen between and .
Section 926
⟨ Global variables 13 ⟩+≡
str_number hyph_word[HYPH_SIZE + 1]; // exception words
pointer hyph_list[HYPH_SIZE + 1]; // lists of hyphen positions
int hyph_count; // the number of words in the exception dictionary
Section 927
⟨ Local variables for initialization 19 ⟩+≡
int z; // runs through the exception dictionary
Section 928
⟨ Set initial values of key variables 21 ⟩+≡
for(z = 0; z <= HYPH_SIZE; z++) {
hyph_word[z] = 0;
hyph_list[z] = null;
}
hyph_count = 0;
Section 929
The algorithm for exception lookup is quite simple, as soon as we have a few more local variables to work with.
⟨ Local variables for hyphenation 901 ⟩+≡
int h; // an index into |hyph_word| and |hyph_list|
str_number k; // an index into |str_start|
pool_pointer u; // an index into |str_pool|
Section 930
First we compute the hash code h, then we search until we either find the word or we don’t. Words from different languages are kept separate by appending the language code to the string.
⟨ Look for the word hc[1..hn] in the exception table, and goto found (with hyf containing the hyphens) if an entry is found 930 ⟩≡
h = hc[1];
incr(hn);
hc[hn] = cur_lang;
for(j = 2; j <= hn; j++) {
h = (h + h + hc[j]) % HYPH_SIZE;
}
while(true) {
// << If the string |hyph_word[h]| is less than |hc[1..hn]|, |goto not_found|; but if the two strings are equal, set |hyf| to the hyphen positions and |goto found|, 931 >>
if (h > 0) {
decr(h);
}
else {
h = HYPH_SIZE;
}
}
not_found:
decr(hn);
Section 931
⟨ If the string hyph_word[h] is less than hc[1..hn], goto not_found; but if the two strings are equal, set hyf to the hyphen positions and goto found 931 ⟩≡
k = hyph_word[h];
if (k == 0 || length(k) < hn) {
goto not_found;
}
if (length(k) == hn) {
j = 1;
u = str_start[k];
do {
if (str_pool[u] < hc[j]) {
goto not_found;
}
if (str_pool[u] > hc[j]) {
goto done;
}
incr(j);
incr(u);
} while (j <= hn);
// << Insert hyphens as specified in |hyph_list[h]|, 932 >>
decr(hn);
goto found;
}
done:
Section 932
⟨ Insert hyphens as specified in hyph_list[h] 932 ⟩≡
s = hyph_list[h];
while (s != null) {
hyf[info(s)] = 1;
s = link(s);
}
Section 933
⟨ Search hyph_list for pointers to p 933 ⟩≡
for(q = 0; q <= HYPH_SIZE; q++) {
if (hyph_list[q] == p) {
print_nl("HYPH(");
print_int(q);
print_char(')');
}
}
Section 934
We have now completed the hyphenation routine, so the line_break procedure is finished at last. Since the hyphenation exception table is fresh in our minds, it’s a good time to deal with the routine that adds new entries to it.
When has scanned ‘\hyphenation
’, it calls on a procedure named new_hyph_exceptions to do the right thing.
#define set_cur_lang \
do { \
if (language <= 0) { \
cur_lang = 0; \
} \
else if (language > 255) { \
cur_lang = 0; \
} \
else { \
cur_lang = language; \
} \
} while(0)
void new_hyph_exceptions();
// << Start file |hyph_scan.c|, 1382 >>
// enters new exceptions
void new_hyph_exceptions() {
int n; // length of current word; not always a |small_number|
int j; // an index into |hc|
int h; // an index into |hyph_word| and |hyph_list|
str_number k; // an index into |str_start|
pointer p; // head of a list of hyphen positions
pointer q; // used when creating a new node for list |p|
str_number s, t; // strings being compared or stored
pool_pointer u, v; // indices into |str_pool|
scan_left_brace(); // a left brace must follow \\hyphenation}
set_cur_lang;
// << Enter as many hyphenation exceptions as are listed, until coming to a right brace; then |return|, 935 >>
}
Section 935
⟨ Enter as many hyphenation exceptions as are listed, until coming to a right brace; then return 935 ⟩≡
n = 0;
p = null;
while(true) {
get_x_token();
reswitch:
switch (cur_cmd) {
case LETTER:
case OTHER_CHAR:
case CHAR_GIVEN:
// << Append a new letter or hyphen, 937 >>
break;
case CHAR_NUM:
scan_char_num();
cur_chr = cur_val;
cur_cmd = CHAR_GIVEN;
goto reswitch;
case SPACER:
case RIGHT_BRACE:
if (n > 1) {
// << Enter a hyphenation exception, 939 >>
}
if (cur_cmd == RIGHT_BRACE) {
return;
}
n = 0;
p = null;
break;
default:
// << Give improper \hyphenation error, 936 >>
}
}
Section 936
⟨ Give improper \hyphenation error 936 ⟩≡
print_err("Improper ");
print_esc("hyphenation");
print(" will be flushed");
help2("Hyphenation exceptions must contain only letters")
("and hyphens. But continue; I'll forgive and forget.");
error();
Section 937
⟨ Append a new letter or hyphen 937 ⟩≡
if (cur_chr == '-') {
// << Append the value |n| to list |p|, 938 >>
}
else {
if (lc_code(cur_chr) == 0) {
print_err("Not a letter");
help2("Letters in \\hyphenation words must have \\lccode>0.")
("Proceed; I'll ignore the character I just read.");
error();
}
else if (n < 63) {
incr(n);
hc[n] = lc_code(cur_chr);
}
}
Section 938
⟨ Append the value n to list p 938 ⟩≡
if (n < 63) {
q = get_avail();
link(q) = p;
info(q) = n;
p = q;
}
Section 939
⟨ Enter a hyphenation exception 939 ⟩≡
incr(n);
hc[n] = cur_lang;
str_room(n);
h = 0;
for(j = 1; j <= n; j++) {
h = (h + h + hc[j]) % HYPH_SIZE;
append_char(hc[j]);
}
s = make_string();
// << Insert the pair |(s,p)| into the exception table, 940 >>
Section 940
⟨ Insert the pair (s,p) into the exception table 940 ⟩≡
if (hyph_count == HYPH_SIZE) {
overflow("exception dictionary", HYPH_SIZE);
}
incr(hyph_count);
while (hyph_word[h] != 0) {
// << If the string |hyph_word[h]| is less than or equal to |s|, interchange |(hyph_word[h],hyph_list[h])| with |(s,p)|, 941 >>
if (h > 0) {
decr(h);
}
else {
h = HYPH_SIZE;
}
}
hyph_word[h] = s;
hyph_list[h] = p;
Section 941
⟨ If the string hyph_word[h] is less than or equal to s, interchange (hyph_word[h],hyph_list[h]) with (s,p) 941 ⟩≡
k = hyph_word[h];
if (length(k) < length(s)) {
goto found;
}
if (length(k) > length(s)) {
goto not_found;
}
u = str_start[k];
v = str_start[s];
do {
if (str_pool[u] < str_pool[v]) {
goto found;
}
if (str_pool[u] > str_pool[v]) {
goto not_found;
}
incr(u);
incr(v);
} while (u != str_start[k + 1]);
found:
q = hyph_list[h];
hyph_list[h] = p;
p = q;
t = hyph_word[h];
hyph_word[h] = s;
s = t;
not_found: