Section 38: String handling
Control sequence names and diagnostic messages are variable-length strings of eight-bit characters. Since Pascal does not have a well-developed string mechanism, does all of its string processing by homegrown methods.
Elaborate facilities for dynamic strings are not needed, so all of the necessary operations can be handled with a simple data structure. The array str_pool contains all of the (eight-bit) ASCII codes in all of the strings, and the array str_start contains indices of the starting points of each string. Strings are referred to by integer numbers, so that string number s comprises the characters str_pool[j] for str_start[s] j str_start[s + 1]. Additional integer variables pool_ptr and str_ptr indicate the number of entries used so far in str_pool and str_start, respectively; locations str_pool[pool_ptr] and str_start[str_ptr] are ready for the next string to be allocated.
String numbers 0 to 255 are reserved for strings that correspond to single ASCII characters.
This is in accordance with the conventions of WEB
, which converts single-character strings into the ASCII code number of the single character involved, while it converts other strings into integers and builds a string pool file.
Thus, when the string constant "."
appears in the program below, WEB
converts it into the integer 46, which is the ASCII code for a period, while WEB
will convert a string like "hello"
into some integer greater than 255.
String number 46 will presumably be the single character ‘.
’;
but some ASCII codes have no standard visible representation, and sometimes needs to be able to print an arbitrary ASCII character, so the first 256 strings are used to specify exactly what should be printed for each of the 256 possibilities.
Elements of the str_pool array must be ASCII codes that can actually be printed; i.e., they must have an xchr equivalent in the local character set. (This restriction applies only to preloaded strings, not to those generated dynamically by the user.)
Some Pascal compilers won’t pack integers into a single byte unless the integers lie in the range −128 .. 127. To accommodate such systems we access the string pool only via macros that can easily be redefined.
Strings in the source code are mostly handled as
char *
and not howWEB
used to do (using an external file to copy during preprocessing, and then loaded them in the string pool).The si and so macros are not needed, and packed_ASCII_code is not introduced.
⟨ Types in the outer block 18 ⟩+≡
typedef int pool_pointer; // for variables that point into |str_pool|
typedef int str_number; // for variables that point into |str_start|
Section 39
⟨ Global variables 13 ⟩+≡
ASCII_code str_pool[POOL_SIZE + 1]; // the characters
pool_pointer str_start[MAX_STRINGS + 1]; // the starting pointers
pool_pointer pool_ptr; // first unused position in |str_pool|
str_number str_ptr; // number of the current string being created
pool_pointer init_pool_ptr; // the starting value of |pool_ptr|
str_number init_str_ptr; // the starting value of |str_ptr|
Section 40
Several of the elementary string operations are performed using WEB
macros instead of Pascal procedures, because many of the operations are done quite frequently and we want to avoid the overhead of procedure calls.
For example, here is a simple macro that computes the length of a string.
// << Start file |strings.h|, 1381 >>
#define length(X) (str_start[(X) + 1] - str_start[(X)]) // the number of characters in string number X
Section 41
The length of the current string is called cur_length:
#define cur_length (pool_ptr - str_start[str_ptr])
Section 42
Strings are created by appending character codes to str_pool. The append_char macro, defined here, does not check to see if the value of pool_ptr has gotten too high; this test is supposed to be made before append_char is used. There is also a flush_char macro, which erases the last character appended.
To test if there is room to append l more characters to str_pool, we shall write str_room(l), which aborts and gives an apologetic error message if there isn’t enough room.
// put |ASCII_code| X at the end of |str_pool|
#define append_char(X) str_pool[pool_ptr] = (X); incr(pool_ptr)
// forget the last character in the pool
#define flush_char decr(pool_ptr)
// make sure that the pool hasn't overflowed
#define str_room(X) \
do { \
if (pool_ptr + (X) > POOL_SIZE) { \
overflow("pool size", POOL_SIZE - init_pool_ptr); \
} \
} while (0)
Section 43
Once a sequence of characters has been appended to str_pool, it officially becomes a string when the function make_string is called. This function returns the identification number of the new string as its value.
// current string enters the pool
str_number make_string() {
if (str_ptr == MAX_STRINGS) {
overflow("number of strings", MAX_STRINGS - init_str_ptr);
}
incr(str_ptr);
str_start[str_ptr] = pool_ptr;
return str_ptr - 1;
}
Section 44
To destroy the most recently made string, we say flush_string.
#define flush_string decr(str_ptr); pool_ptr = str_start[str_ptr]
Section 45
The following subroutine compares string s with another string of the same length that appears in buffer starting at position k; the result is true if and only if the strings are equal. Empirical tests indicate that str_eq_buf is used in such a way that it tends to return true about 80 percent of the time.
// test equality of strings
int str_eq_buf(str_number s, int k) {
pool_pointer j; // running index
bool result = true; // result of comparison;
j = str_start[s];
while (j < str_start[s + 1]) {
if (str_pool[j] != buffer[k]) {
result = false;
break; // Goto not_found
}
incr(j);
incr(k);
}
// not_found:
return result;
}
Section 46
Here is a similar routine, but it compares two strings in the string pool, and it does not assume that they have the same length.
// test equality of strings
int str_eq_str(str_number s, str_number t) {
pool_pointer j, k; // running indices
bool result = false; // result of comparison
if (length(s) != length(t)) {
goto not_found;
}
j = str_start[s];
k = str_start[t];
while (j < str_start[s + 1]) {
if (str_pool[j] != str_pool[k]) {
goto not_found;
}
incr(j);
incr(k);
}
result = true;
not_found:
return result;
}
Section 47
The initial values of str_pool, str_start, pool_ptr, and str_ptr are computed by the INITEX
program, based in part on the information that WEB
has output while processing .
Most of the internal strings in the source code are treated as
char *
instead of using theTEX.POOL
file thatWEB
creates while tangling. Though, some of them are still needed in str_pool which is taken care of in ⟨ Read the other strings 51 ⟩.
#ifdef INIT
// initializes the string pool
void get_strings_started() {
int k, l; // small indices or counters
pool_ptr = 0;
str_ptr = 0;
str_start[0] = 0;
// << Make the first 256 strings, 48 >>
// << Read the other strings, 51 >>
}
#endif
Section 48
#define app_lc_hex(X) \
do { \
l = (X); \
if (l < 10) { \
append_char(l + '0'); \
} \
else { \
append_char(l - 10 + 'a'); \
} \
} while (0)
⟨ Make the first 256 strings 48 ⟩≡
for (k = 0; k < 256; k++) {
if (k < ' ' || k > '~') {
append_char('^');
append_char('^');
if (k < 64) {
append_char(k + 64);
}
else if (k < 128) {
append_char(k - 64);
}
else {
app_lc_hex(k / 16);
app_lc_hex(k % 16);
}
}
else {
append_char(k);
}
make_string();
}
Section 49
The first 128 strings will contain 95 standard ASCII characters, and the other 33 characters will be printed in three-symbol form like ‘^^A
’ unless a system-dependent change is made here. Installations that have an extended character set, where for example XCHR[26] = ‘≠
’, would like string 26 to be the single character 26 instead of the three characters 94, 94, 90 (^^Z
).
On the other hand, even people with an extended character set will want to represent string 13 by ^^M
, since 13 is CARRIAGE_RETURN;
the idea is to produce visible strings instead of tabs or line-feeds or carriage-returns or bell-rings or characters that are treated anomalously in text files.
Unprintable characters of codes 128–255 are, similarly, rendered ^^80
–^^ff
.
The boolean expression defined here should be true unless internal code number k corresponds to a non-troublesome visible symbol in the local character set. An appropriate formula for the extended character set recommended in The TeXbook would, for example, be k [0, 8 .. 10, 12, 13, 27, 127 .. 255]. If character k cannot be printed, and k 128, then character k + 64 or k − 64 must be printable; moreover, ASCII codes [33 .. 38, 48 .. 57, 94, 97 .. 102, 112 .. 121] must be printable. Thus, at least 80 printable characters are needed.
The condition ⟨ Character |k| cannot be printed, 49 ⟩ is already included in section 48.
Section 50
When the WEB
system program called TANGLE
processes the TEX.WEB
description that you are now reading, it outputs the Pascal program TEX.PAS
and also a string pool file called TEX.POOL
.
The INITEX
program reads the latter file, where each string appears as a two-digit decimal length followed by the string itself, and the information is recorded in ’s string memory.
The pool file is not used.
Section 51
We don’t read from a pool file, but some strings are still needed to be added in the string pool.
This is a procedure that adds a
char *
string in the pool, with an example to add the empty string.
str_number put_string(char *s) {
int n = strlen(s);
str_room(n);
memcpy(&str_pool[pool_ptr], s, n);
pool_ptr += n;
return make_string();
}
⟨ Read the other strings 51 ⟩≡
put_string(""); // Empty string, its number should be 256
// << Internal strings numbers in the pool, 51 >>
⟨ Internal strings numbers in the pool 51 ⟩≡
#define EMPTY_STRING 256 // ""
Section 52
The pool file is not used.
Section 53
The WEB
operation @$
denotes the value that should be at the end of this TEX.POOL
file; any other value means that the wrong pool file has been loaded.
The pool file is not used.