Section 133: Data structures for boxes and their friends

From the computer’s standpoint, $T E X$ ’s chief mission is to create horizontal and vertical lists. We shall now investigate how the elements of these lists are represented internally as nodes in the dynamic memory.

A horizontal or vertical list is linked together by link fields in the first word of each node. Individual nodes represent boxes, glue, penalties, or special things like discretionary hyphens; because of this variety, some nodes are longer than others, and we must distinguish different kinds of nodes. We do this by putting a ‘type’ field in the first word, together with the link and an optional ‘subtype’.

datastructures.h

#define type(X)    hh_b0(mem[(X)]) // identifies what kind of node this is
#define subtype(X) hh_b1(mem[(X)]) // secondary identification in some cases

Section 134

A CHAR_NODE, which represents a single character, is the most important kind of node because it accounts for the vast majority of all boxes. Special precautions are therefore taken to ensure that a CHAR_NODE does not take up much memory space. Every such node is one word long, and in fact it is identifiable by this property, since other kinds of nodes have at least two words, and they appear in mem locations less than hi_mem_min. This makes it possible to omit the type field in a CHAR_NODE, leaving us room for two bytes that identify a font and a character within that font.

Note that the format of a CHAR_NODE allows for up to 256 different fonts and up to 256 characters per font; but most implementations will probably limit the total number of fonts to fewer than 75 per job, and most fonts will stick to characters whose codes are less than 128 (since higher codes are more difficult to access on most keyboards).

Extensions of $T E X$ intended for oriental languages will need even more than 256 $\times$ 256 possible characters, when we consider different sizes and styles of type. It is suggested that Chinese and Japanese fonts be handled by representing such characters in two consecutive CHAR_NODE entries: The first of these has font = FONT_BASE, and its link points to the second; the second identifies the font and the character dimensions. The saving feature about oriental characters is that most of them have the same box dimensions. The character field of the first CHAR_NODE is a “charext” that distinguishes between graphic symbols whose dimensions are identical for typesetting purposes. (See the METAFONT manual.) Such an extension of $T E X$ would not be difficult; further details are left to the reader.

In order to make sure that the character code fits in a quarterword, $T E X$ adds the quantity MIN_QUARTERWORD to the actual code.

Character nodes appear only in horizontal lists, never in vertical lists.

datastructures.h

#define is_char_node(X) ((X) >= hi_mem_min) // does the argument point to a |CHAR_NODE|?
#define font            type                // the font code in a |CHAR_NODE|
#define character       subtype             // the character code in a |CHAR_NODE|

Section 135

An HLIST_NODE stands for a box that was made from a horizontal list. Each HLIST_NODE is seven words long, and contains the following fields (in addition to the mandatory type and link, which we shall not mention explicitly when discussing the other node types): The height and width and depth are scaled integers denoting the dimensions of the box. There is also a shift_amount field, a scaled integer indicating how much this box should be lowered (if it appears in a horizontal list), or how much it should be moved to the right (if it appears in a vertical list). There is a list_ptr field, which points to the beginning of the list from which this box was fabricated; if list_ptr is null, the box is empty. Finally, there are three fields that represent the setting of the glue: glue_set(p) is a word of type glue_ratio that represents the proportionality constant for glue setting; glue_sign(p) is STRETCHING or SHRINKING or NORMAL depending on whether or not the glue should stretch or shrink or remain rigid; and glue_order(p) specifies the order of infinity to which glue setting applies (NORMAL, FIL, FILL, or FILLL). The subtype field is not used.

constants.h

#define HLIST_NODE    0 // |type| of hlist nodes
#define BOX_NODE_SIZE 7 // number of words to allocate for a box node
#define WIDTH_OFFSET  1 // position of |width| field in a box node
#define DEPTH_OFFSET  2 // position of |depth| field in a box node
#define HEIGHT_OFFSET 3 // position of |height| field in a box node
#define LIST_OFFSET   5 // position of |list_ptr| field in a box node
#define NORMAL        0 // the most common case when several cases are named
#define STRETCHING    1 // glue setting applies to the stretch components
#define SHRINKING     2 // glue setting applies to the shrink components
#define GLUE_OFFSET   6 // position of |glue_set| in a box node

datastructures.h

#define width(X)        mem[(X) + WIDTH_OFFSET].sc  // width of the box, in sp
#define depth(X)        mem[(X) + DEPTH_OFFSET].sc  // depth of the box, in sp
#define height(X)       mem[(X) + HEIGHT_OFFSET].sc // height of the box, in sp
#define shift_amount(X) mem[(X) + 4].sc             // repositioning distance, in sp
#define list_ptr(X)     link((X) + LIST_OFFSET)     // beginning of the list inside the box
#define glue_order(X)   subtype((X) + LIST_OFFSET)  // applicable order of infinity
#define glue_sign(X)    type((X) + LIST_OFFSET)     // stretching or shrinking
#define glue_set(X)     mem[(X) + GLUE_OFFSET].gr   // a word of type |glue_ratio| for glue setting

Section 136

The new_null_box function returns a pointer to an HLIST_NODE in which all subfields have the values corresponding to ‘\hbox{}’. (The subtype field is set to MIN_QUARTERWORD, for historic reasons that are no longer relevant.)

nodes.c

// << Start file |nodes.c|, 1382 >>

// creates a new box node
pointer new_null_box() {
    pointer p = get_node(BOX_NODE_SIZE); // the new node
    type(p) = HLIST_NODE;
    subtype(p) = MIN_QUARTERWORD;
    width(p) = 0;
    depth(p) = 0;
    height(p) = 0;
    shift_amount(p) = 0;
    list_ptr(p) = null;
    glue_sign(p) = NORMAL;
    glue_order(p) = NORMAL;
    set_glue_ratio_zero(glue_set(p));
    return p;
}

Section 137

A VLIST_NODE is like an HLIST_NODE in all respects except that it contains a vertical list.

constants.h

#define VLIST_NODE 1 // |type| of vlist nodes

Section 138

A RULE_NODE stands for a solid black rectangle; it has width, depth, and height fields just as in an HLIST_NODE. However, if any of these dimensions is $- 2^{30}$ , the actual value will be determined by running the rule up to the boundary of the innermost enclosing box. This is called a “running dimension”. The width is never running in an hlist; the height and depth are never running in a vlist.

constants.h

#define RULE_NODE 2           // |type| of rule nodes
#define RULE_NODE_SIZE 4      // number of words to allocate for a rule node
#define NULL_FLAG -0x40000000 // -2^{30}, signifies a missing item

datastructures.h

#define is_running(X) ((X) == NULL_FLAG) // tests for a running dimension

Section 139

A new rule node is delivered by the new_rule function. It makes all the dimensions “running”, so you have to change the ones that are not allowed to run.

nodes.c

pointer new_rule() {
    pointer p = get_node(RULE_NODE_SIZE); // the new node
    type(p) = RULE_NODE;
    subtype(p) = 0; // the |subtype| is not used
    width(p) = NULL_FLAG;
    depth(p) = NULL_FLAG;
    height(p) = NULL_FLAG;
    return p;
}

Section 140

Insertions are represented by INS_NODE records, where the subtype indicates the corresponding box number. For example, ‘\insert 250’ leads to an INS_NODE whose subtype is 250 + MIN_QUARTERWORD. The height field of an INS_NODE is slightly misnamed; it actually holds the natural height plus depth of the vertical list being inserted. The depth field holds the split_max_depth to be used in case this insertion is split, and the split_top_ptr points to the corresponding split_top_skip. The float_cost field holds the floating_penalty that will be used if this insertion floats to a subsequent page after a split insertion of the same class. There is one more field, the ins_ptr, which points to the beginning of the vlist for the insertion.

constants.h

#define INS_NODE      3 // |type| of insertion nodes
#define INS_NODE_SIZE 5 // number of words to allocate for an insertion

datastructures.h

#define float_cost(X)    mem[(X) + 1].integer // the |floating_penalty| to be used
#define ins_ptr(X)       info((X) + 4)        // the vertical list to be inserted
#define split_top_ptr(X) link((X) + 4)        // the |split_top_skip| to be used

Section 141

A MARK_NODE has a mark_ptr field that points to the reference count of a token list that contains the user’s \mark text. This field occupies a full word instead of a halfword, because there’s nothing to put in the other halfword; it is easier in Pascal to use the full word than to risk leaving garbage in the unused half.

constants.h

#define MARK_NODE       4 // |type| of a mark node
#define SMALL_NODE_SIZE 2 // number of words to allocate for most node types

datastructures.h

#define mark_ptr(X) mem[(X) + 1].integer // head of the token list for a mark

Section 142

An ADJUST_NODE, which occurs only in horizontal lists, specifies material that will be moved out into the surrounding vertical list; i.e., it is used to implement $T E X$ ’s ‘\vadjust’ operation. The adjust_ptr field points to the vlist containing this material.

constants.h

#define ADJUST_NODE 5 // |type| of an adjust node

datastructures.h

#define adjust_ptr mark_ptr // vertical list to be moved out of horizontal list

Section 143

A LIGATURE_NODE, which occurs only in horizontal lists, specifies a character that was fabricated from the interaction of two or more actual characters. The second word of the node, which is called the lig_char word, contains font and character fields just as in a CHAR_NODE. The characters that generated the ligature have not been forgotten, since they are needed for diagnostic messages and for hyphenation; the lig_ptr field points to a linked list of character nodes for all original characters that have been deleted. (This list might be empty if the characters that generated the ligature were retained in other nodes.)

The subtype field is 0, plus 2 and/or 1 if the original source of the ligature included implicit left and/or right boundaries.

constants.h

#define LIGATURE_NODE 6 // |type| of a ligature node

datastructures.h

#define lig_char(X) ((X) + 1)           // the word where the ligature is to be found
#define lig_ptr(X)  link(lig_char((X))) // the list of characters

Section 144

The new_ligature function creates a ligature node having given contents of the font, character, and lig_ptr fields. We also have a new_lig_item function, which returns a two-word node having a given character field. Such nodes are used for temporary processing as ligatures are being created.

nodes.c

pointer new_ligature(quarterword f, quarterword c, pointer q) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = LIGATURE_NODE;
    font(lig_char(p)) = f;
    character(lig_char(p)) = c;
    lig_ptr(p) = q;
    subtype(p) = 0;
    return p;
}

pointer new_lig_item(quarterword c) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    character(p) = c;
    lig_ptr(p) = null;
    return p;
}

Section 145

A DISC_NODE, which occurs only in horizontal lists, specifies a “discretionary” line break. If such a break occurs at node p, the text that starts at pre_break(p) will precede the break, the text that starts at post_break(p) will follow the break, and text that appears in the next replace_count(p) nodes will be ignored. For example, an ordinary discretionary hyphen, indicated by ‘\-’, yields a DISC_NODE with pre_break pointing to a CHAR_NODE containing a hyphen, post_break = null, and replace_count = 0. All three of the discretionary texts must be lists that consist entirely of character, kern, box, rule, and ligature nodes.

If pre_break(p) = null, the ex_hyphen_penalty will be charged for this break. Otherwise the hyphen_penalty will be charged. The texts will actually be substituted into the list by the line-breaking algorithm if it decides to make the break, and the discretionary node will disappear at that time; thus, the output routine sees only discretionaries that were not chosen.

constants.h

#define DISC_NODE 7 // |type| of a discretionary node

datastructures.h

#define replace_count subtype // how many subsequent nodes to replace
#define pre_break     llink   // text that precedes a discretionary break
#define post_break    rlink   // text that follows a discretionary break

nodes.c

pointer new_disc() {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = DISC_NODE;
    replace_count(p) = 0;
    pre_break(p) = null;
    post_break(p) = null;
    return p;
}

Section 146

A WHATSIT_NODE is a wild card reserved for extensions to $T E X$ . The subtype field in its first word says what ‘whatsit’ it is, and implicitly determines the node size (which must be 2 or more) and the format of the remaining words. When a WHATSIT_NODE is encountered in a list, special actions are invoked; knowledgeable people who are careful not to mess up the rest of $T E X$ are able to make $T E X$ do new things by adding code at the end of the program. For example, there might be a ‘ $T E X$ nicolor’ extension to specify different colors of ink, and the whatsit node might contain the desired parameters.

The present implementation of $T E X$ treats the features associated with ‘\write’ and ‘\special’ as if they were extensions, in order to illustrate how such routines might be coded. We shall defer further discussion of extensions until the end of this program.

constants.h

#define WHATSIT_NODE 8 // |type| of special extension nodes

Section 147

A MATH_NODE, which occurs only in horizontal lists, appears before and after mathematical formulas. The subtype field is before before the formula and after after it. There is a width field, which represents the amount of surrounding space inserted by \mathsurround.

constants.h

#define MATH_NODE 9 // |type| of a math node
#define BEFORE    0 // |subtype| for math node that introduces a formula
#define AFTER     1 // |subtype| for math node that winds up a formula

nodes.c

pointer new_math(scaled w, small_number s) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = MATH_NODE;
    subtype(p) = s;
    width(p) = w;
    return p;
}

Section 148

$T E X$ makes use of the fact that HLIST_NODE, VLIST_NODE, RULE_NODE, INS_NODE, MARK_NODE, ADJUST_NODE, LIGATURE_NODE, DISC_NODE, WHATSIT_NODE, and MATH_NODE are at the low end of the type codes, by permitting a break at glue in a list if and only if the type of the previous node is less than MATH_NODE. Furthermore, a node is discarded after a break if its type is MATH_NODE or more.

datastructures.h

#define precedes_break(X)  (type((X)) < MATH_NODE)
#define non_discardable(X) (type((X)) < MATH_NODE)

Section 149

A GLUE_NODE represents glue in a list. However, it is really only a pointer to a separate glue specification, since $T E X$ makes use of the fact that many essentially identical nodes of glue are usually present. If p points to a GLUE_NODE, glue_ptr(p) points to another packet of words that specify the stretch and shrink components, etc.

Glue nodes also serve to represent leaders; the subtype is used to distinguish between ordinary glue (which is called NORMAL) and the three kinds of leaders (which are called A_LEADERS, C_LEADERS, and X_LEADERS). The leader_ptr field points to a rule node or to a box node containing the leaders; it is set to null in ordinary glue nodes.

Many kinds of glue are computed from $T E X$ ’s “skip” parameters, and it is helpful to know which parameter has led to a particular glue node. Therefore the subtype is set to indicate the source of glue, whenever it originated as a parameter. We will be defining symbolic names for the parameter numbers later (e.g., LINE_SKIP_CODE = 0, BASELINE_SKIP_CODE = 1, etc.); it suffices for now to say that the subtype of parametric glue will be the same as the parameter number, plus one.

In math formulas there are two more possibilities for the subtype in a glue node: MU_GLUE denotes an \mskip (where the units are scaled mu instead of scaled pt); and COND_MATH_GLUE denotes the ‘\nonscript’ feature that cancels the glue node immediately following if it appears in a subscript.

constants.h

#define GLUE_NODE      10  // |type| of node that points to a glue specification
#define COND_MATH_GLUE 98  // special |subtype| to suppress glue in the next node
#define MU_GLUE        99  // |subtype| for math glue
#define A_LEADERS      100 // |subtype| for aligned leaders
#define C_LEADERS      101 // |subtype| for centered leaders
#define X_LEADERS      102 // |subtype| for expanded leaders

datastructures.h

#define glue_ptr   llink // pointer to a glue specification
#define leader_ptr rlink // pointer to box or rule node for leaders

Section 150

A glue specification has a halfword reference count in its first word, representing null plus the number of glue nodes that point to it (less one). Note that the reference count appears in the same position as the link field in list nodes; this is the field that is initialized to null when a node is allocated, and it is also the field that is flagged by EMPTY_FLAG in empty nodes.

Glue specifications also contain three scaled fields, for the width, stretch, and shrink dimensions. Finally, there are two one-byte fields called stretch_order and shrink_order; these contain the orders of infinity (NORMAL, FIL, FILL, or FILLL) corresponding to the stretch and shrink values.

constants.h

#define GLUE_SPEC_SIZE 4 // number of words to allocate for a glue specification
#define FIL            1 // first-order infinity
#define FILL           2 // second-order infinity
#define FILLL          3 // third-order infinity

datastructures.h

#define glue_ref_count(X) link((X))       // reference count of a glue specification
#define stretch(X)        mem[(X) + 2].sc // the stretchability of this glob of glue
#define shrink(X)         mem[(X) + 3].sc // the shrinkability of this glob of glue
#define stretch_order     type            // order of infinity for stretching
#define shrink_order      subtype         // order of infinity for shrinking

Section 151

Here is a function that returns a pointer to a copy of a glue spec. The reference count in the copy is null, because there is assumed to be exactly one reference to the new specification.

nodes.c

// duplicates a glue specification
pointer new_spec(pointer p) {
    pointer q = get_node(GLUE_SPEC_SIZE); // the new spec
    mem[q] = mem[p];
    glue_ref_count(q) = null;
    width(q) = width(p);
    stretch(q) = stretch(p);
    shrink(q) = shrink(p);
    return q;
}

Section 152

And here’s a function that creates a glue node for a given parameter identified by its code number; for example, new_param_glue(LINE_SKIP_CODE) returns a pointer to a glue node for the current \lineskip.

nodes.c

pointer new_param_glue(small_number n) {
    pointer p; // the new node
    pointer q; // the glue specification
    p = get_node(SMALL_NODE_SIZE);
    type(p) = GLUE_NODE;
    subtype(p) = n + 1;
    leader_ptr(p) = null;
    q = glue_par(n);
    glue_ptr(p) = q;
    incr(glue_ref_count(q));
    return p;
}

Section 153

Glue nodes that are more or less anonymous are created by new_glue, whose argument points to a glue specification.

nodes.c

pointer new_glue(halfword q) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = GLUE_NODE;
    subtype(p) = NORMAL;
    leader_ptr(p) = null;
    glue_ptr(p) = q;
    incr(glue_ref_count(q));
    return p;
}

Section 154

Still another subroutine is needed: This one is sort of a combination of new_param_glue and new_glue. It creates a glue node for one of the current glue parameters, but it makes a fresh copy of the glue specification, since that specification will probably be subject to change, while the parameter will stay put. The global variable temp_ptr is set to the address of the new spec.

nodes.c

pointer new_skip_param(small_number n) {
    pointer p; // the new node
    temp_ptr = new_spec(glue_par(n));
    p = new_glue(temp_ptr);
    glue_ref_count(temp_ptr) = null;
    subtype(p) = n + 1;
    return p;
}

Section 155

A KERN_NODE has a width field to specify a (normally negative) amount of spacing. This spacing correction appears in horizontal lists between letters like A and V when the font designer said that it looks better to move them closer together or further apart. A kern node can also appear in a vertical list, when its ‘width’ denotes additional spacing in the vertical direction. The subtype is either NORMAL (for kerns inserted from font information or math mode calculations) or EXPLICIT (for kerns inserted from \kern and \/ commands) or ACC_KERN (for kerns inserted from non-math accents) or MU_GLUE (for kerns inserted from \mkern specifications in math formulas).

constants.h

#define KERN_NODE 11 // |type| of a kern node
#define EXPLICIT  1  // |subtype| of kern nodes from \kern and \/
#define ACC_KERN  2  // |subtype| of kern nodes from accents

Section 156

The new_kern function creates a kern node having a given width.

nodes.c

pointer new_kern(scaled w) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = KERN_NODE;
    subtype(p) = NORMAL;
    width(p) = w;
    return p;
}

Section 157

A PENALTY_NODE specifies the penalty associated with line or page breaking, in its penalty field. This field is a fullword integer, but the full range of integer values is not used: Any penalty $\geq$ 10000 is treated as infinity, and no break will be allowed for such high values. Similarly, any penalty $\leq$ −10000 is treated as negative infinity, and a break will be forced.

constants.h

#define PENALTY_NODE  12             // |type| of a penalty node
#define INF_PENALTY   INF_BAD        // "infinite" penalty value
#define EJECT_PENALTY (-INF_PENALTY) // "negatively infinite" penalty value

datastructures.h

#define penalty(X) mem[(X) + 1].integer // the added cost of breaking a list here

Section 158

Anyone who has been reading the last few sections of the program will be able to guess what comes next.

nodes.c

pointer new_penalty(int m) {
    pointer p = get_node(SMALL_NODE_SIZE); // the new node
    type(p) = PENALTY_NODE;
    subtype(p) = 0; // the |subtype| is not used
    penalty(p) = m;
    return p;
}

Section 159

You might think that we have introduced enough node types by now. Well, almost, but there is one more: An UNSET_NODE has nearly the same format as an HLIST_NODE or VLIST_NODE; it is used for entries in \halign or \valign that are not yet in their final form, since the box dimensions are their “natural” sizes before any glue adjustment has been made. The glue_set word is not present; instead, we have a glue_stretch field, which contains the total stretch of order glue_order that is present in the hlist or vlist being boxed. Similarly, the shift_amount field is replaced by a glue_shrink field, containing the total shrink of order glue_sign that is present. The subtype field is called span_count; an unset box typically contains the data for span_count + 1 columns. Unset nodes will be changed to box nodes when alignment is completed.

constants.h

#define UNSET_NODE 13 // |type| for an unset node

datastructures.h

#define glue_stretch(X) mem[(X) + GLUE_OFFSET].sc // total stretch in an unset node
#define glue_shrink     shift_amount              // total shrink in an unset node
#define span_count      subtype                   // indicates the number of spanned columns

Section 160

In fact, there are still more types coming. When we get to math formula processing we will see that a STYLE_NODE has type = 14; and a number of larger type codes will also be defined, for use in math mode only.

Section 161

Warning: If any changes are made to these data structure layouts, such as changing any of the node sizes or even reordering the words of nodes, the copy_node_list procedure and the memory initialization code below may have to be changed. Such potentially dangerous parts of the program are listed in the index under ‘data structure assumptions’. However, other references to the nodes are made symbolically in terms of the WEB macro definitions above, so that format changes will leave $T E X$ ’s other algorithms intact.

TeX in C