Section 17: The character set

In order to make $T E X$ readily portable to a wide variety of computers, all of its input text is converted to an internal eight-bit code that includes standard ASCII, the “American Standard Code for Information Interchange”. This conversion is done immediately when each character is read in. Conversely, characters are converted from ASCII to the user’s external representation just before they are output to a text file.

Such an internal code is relevant to users of $T E X$ primarily because it governs the positions of characters in the fonts. For example, the character ‘A’ has ASCII code 65, and when $T E X$ typesets this letter it specifies character number 65 in the current font. If that font actually has ‘A’ in a different position, $T E X$ doesn’t know what the real position is; the program that does the actual printing from $T E X$ ’s device-independent files is responsible for converting from ASCII to a particular font encoding.

$T E X$ ’s internal code also defines the value of constants that begin with a reverse apostrophe; and it provides an index to the \catcode, \mathcode, \uccode, \lccode, and \delcode tables.

Section 18

Characters of text that have been converted to $T E X$ ’s internal form are said to be of type ASCII_code, which is a subrange of the integers.

⟨ Types in the outer block 18 ⟩≡

typedef unsigned char ASCII_code; // eight-bit numbers

Section 19

The original Pascal compiler was designed in the late 60s, when six-bit character sets were common, so it did not make provision for lowercase letters. Nowadays, of course, we need to deal with both capital and small letters in a convenient way, especially in a program for typesetting; so the present specification of $T E X$ has been written under the assumption that the Pascal compiler and run-time system permit the use of text files with more than 64 distinguishable characters. More precisely, we assume that the character set contains at least the letters and symbols associated with ASCII codes 32 through 126; all of these characters are now available on most computer terminals.

Since we are dealing with more characters than were present in the first Pascal compilers, we have to decide what to call the associated data type. Some Pascals use the original name char for the characters in text files, even though there now are more than 64 such characters, while other Pascals consider char to be a 64-element subrange of a larger data type that has some other name.

In order to accommodate this difference, we shall use the name text_char to stand for the data type of the characters that are converted to and from ASCII_code when they are input and output. We shall also assume that text_char consists of the elements chr(FIRST_TEXT_CHAR) through chr(LAST_TEXT_CHAR), inclusive. The following definitions should be adjusted if necessary.

NOTE

text_char is not kept.

constants.h

#define FIRST_TEXT_CHAR 0   // ordinal number of the smallest element of |text_char|
#define LAST_TEXT_CHAR  255 // ordinal number of the largest element of |text_char|

⟨ Local variables for initialization 19 ⟩≡

// int i; no used

Section 20

The $T E X$ processor converts between ASCII code and the user’s external character set by means of arrays XORD and XCHR that are analogous to Pascal’s ord and chr functions.

NOTE

The two arrays are declared in file strings.c and not with the global variables.

tex.h

extern const ASCII_code XORD[256];
extern const unsigned char XCHR[256];

strings.c

// << Start file |strings.c|, 1382 >>

// specifies conversion of input characters
const ASCII_code XORD[256] = {
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27,
    0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,
    0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37,
    0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
    0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
    0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,
    0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
    0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,
    0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67,
    0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,
    0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77,
    0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
    0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
};

// specifies conversion of output characters
const unsigned char XCHR[256] = {
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', '!', '"', '#', '$', '%', '&', '\'',
    '(', ')', '*', '+', ',', '-', '.', '/',
    '0', '1', '2', '3', '4', '5', '6', '7',
    '8', '9', ':', ';', '<', '=', '>', '?',
    '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G',
    'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
    'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
    'X', 'Y', 'Z', '[', '\\', ']', '^', '_',
    '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
    'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
    'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
    'x', 'y', 'z', '{', '|', '}', '~', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
    ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '
};

Section 21

Since we are assuming that our Pascal system is able to read and write the visible characters of standard ASCII (although not necessarily using the ASCII codes to represent them), the following assignment statements initialize the standard part of the XCHR array properly, without needing any system-dependent changes. On the other hand, it is possible to implement $T E X$ with less complete character sets, and in such cases it will be necessary to change something here.

NOTE

Done as declaration of XORD and XCHR in previous section. To keep the number of this section in blocks ⟨ Set initial values of key variables 21 ⟩ from the original code, an empty fenced block is defined.

⟨ Set initial values of key variables 21 ⟩≡

Section 22

Some of the ASCII codes without visible characters have been given symbolic names in this program because they are used with a special meaning.

constants.h

#define NULL_CODE       0   // ASCII code that might disappear
#define CARRIAGE_RETURN 13  // ASCII code used at end of line
#define INVALID_CODE    127 // ASCII code that many systems prohibit in text files

Section 23

The ASCII code is “standard” only to a certain extent, since many computer installations have found it advantageous to have ready access to more than 94 printing characters. Appendix C of The TeXbook gives a complete specification of the intended correspondence between characters and $T E X$ ’s internal representation.

If $T E X$ is being used on a garden-variety Pascal for which only standard ASCII codes will appear in the input and output files, it doesn’t really matter what codes are specified in XCHR[0 .. 31], but the safest policy is to blank everything out by using the code shown below.

However, other settings of XCHR will make $T E X$ more friendly on computers that have an extended character set, so that users can type things like ‘ $\neq =$ ’ instead of ‘\ne’. People with extended character sets can assign codes arbitrarily, giving an XCHR equivalent to whatever characters the users of $T E X$ are allowed to have in their input files. It is best to make the codes correspond to the intended interpretations as shown in Appendix C whenever possible; but this is not necessary. For example, in countries with an alphabet of more than 26 letters, it is usually best to map the additional letters into codes less than 32. To get the most “permissive” character set, change ‘␣’ on the right of these assignment statements to chr(i).

NOTE

Spaces are included in the table XCHR at declaration in section 20.

Section 24

The following system-independent code makes the XORD array contain a suitable inverse to the information in XCHR. Note that if XCHR[i] = XCHR[j] where i $<$ j $<$ 127, the value of XORD[XCHR[i]] will turn out to be j or more; hence, standard ASCII code numbers will be used instead of codes below 32 in case there is a coincidence.

NOTE

Done at declaration in section 20.

TeX in C