Character/Glyph Index Mapping

cmap - Character To Glyph Index Mapping Table

This table defines the mapping of character codes to the glyph index values used in the font. It may contain more than one subtable, in order to support more than one character encoding scheme. Character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. The glyph at this location must be a special glyph representing a missing character.

The table header indicates the character encodings for which subtables are present. Each subtable is in one of four possible formats and begins with a format code indicating the format used.

The platform ID and platform-specific encoding ID are used to specify the subtable; this means that each platform ID/platform-specific encoding ID pair may only appear once in the cmap table. Each subtable can specify a different character encoding. (See the 'name' table section). The entries must be sorted first by platform ID and then by platform-specific encoding ID.

When building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1. When building a symbol font for Windows, the platform ID should be 3 and the encoding ID should be 0. When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0.

All Microsoft Unicode encodings (Platform ID = 3, Encoding ID = 1) must use Format 4 for their 'cmap' subtable. Microsoft strongly recommends using a Unicode 'cmap' for all fonts. However, some other encodings that appear in current fonts follow:

Platform ID Encoding ID Description

3 0 Symbol

3 1 Unicode

3 2 ShiftJIS

3 3 Big5

3 4 PRC

3 5 Wansung

3 6 Johab

Platform ID	Encoding ID	Description
3	0	Symbol
3	1	Unicode
3	2	ShiftJIS
3	3	Big5
3	4	PRC
3	5	Wansung
3	6	Johab

The Character To Glyph Index Mapping Table is organized as follows:

Type Description

USHORT Table version number (0).

USHORT Number of encoding tables, n.

Type	Description
USHORT	Table version number (0).
USHORT	Number of encoding tables, n.

This is followed by an entry for each of the n encoding table specifying the particular encoding, and the offset to the actual subtable:

Type Description

USHORT Platform ID.

USHORT Platform-specific encoding ID.

ULONG Byte offset from beginning of table to the subtable for this encoding.

Type	Description
USHORT	Platform ID.
USHORT	Platform-specific encoding ID.
ULONG	Byte offset from beginning of table to the subtable for this encoding.

Format 0: Byte encoding table

This is the Apple standard character to glyph index mapping table.

Type Name Description

USHORT format Format number is set to 0.

USHORT length This is the length in bytes of the subtable.

USHORT version Version number (starts at 0).

BYTE glyphIdArray[256] An array that maps character codes to glyph index values.

Type	Name	Description
USHORT	format	Format number is set to 0.
USHORT	length	This is the length in bytes of the subtable.
USHORT	version	Version number (starts at 0).
BYTE	glyphIdArray[256]	An array that maps character codes to glyph index values.

This is a simple 1 to 1 mapping of character codes to glyph indices. The glyph set is limited to 256. Note that if this format is used to index into a larger glyph set, only the first 256 glyphs will be accessible.

Format 2: High-byte mapping through table

This subtable is useful for the national character code standards used for Japanese, Chinese, and Korean characters. These code standards use a mixed 8/16-bit encoding, in which certain byte values signal the first byte of a 2-byte character (but these values are also legal as the second byte of a 2-byte character). Character codes are always 1-byte. The glyph set is limited to 256.

In addition, even for the 2-byte characters, the mapping of character codes to glyph index values depends heavily on the first byte. Consequently, the table begins with an array that maps the first byte to a 4-word subHeader. For 2-byte character codes, the subHeader is used to map the second byte's value through a subArray, as described below. When processing mixed 8/16-bit text, subHeader 0 is special: it is used for single-byte character codes. When subHeader zero is used, a second byte is not needed; the single byte value is mapped through the subArray.

Type Name Description

USHORT format Format number is set to 2.

USHORT length Length in bytes.

USHORT version Version number (starts at 0)

USHORT subHeaderKeys[256] Array that maps high bytes to subHeaders: value is subHeader index * 8.

4 words struct subHeaders[ ] Variable-length array of subHeader structures.

4 words-struct subHeaders[ ]

USHORT glyphIndexArray[ ] Variable-length array containing subarrays used for mapping the low byte of 2-byte characters.

Type	Name	Description
USHORT	format	Format number is set to 2.
USHORT	length	Length in bytes.
USHORT	version	Version number (starts at 0)
USHORT	subHeaderKeys[256]	Array that maps high bytes to subHeaders: value is subHeader index * 8.
4 words struct	subHeaders[ ]	Variable-length array of subHeader structures.
4 words-struct	subHeaders[ ]
USHORT	glyphIndexArray[ ]	Variable-length array containing subarrays used for mapping the low byte of 2-byte characters.

A subHeader is structured as follows:

Type Name Description

USHORT firstCode First valid low byte for this subHeader.

USHORT entryCount Number of valid low bytes for this subHeader.

SHORT idDelta See text below.

USHORT idRangeOffset See text below.

Type	Name	Description
USHORT	firstCode	First valid low byte for this subHeader.
USHORT	entryCount	Number of valid low bytes for this subHeader.
SHORT	idDelta	See text below.
USHORT	idRangeOffset	See text below.

The firstCode and entryCount values specify a subrange that begins at firstCode and has a length equal to the value of entryCount. This subrange stays within the 0-255 range of the byte being mapped. Bytes outside of this subrange are mapped to glyph index 0 (missing glyph).The offset of the byte within this subrange is then used as index into a corresponding subarray of glyphIndexArray. This subarray is also of length entryCount. The value of the idRangeOffset is the number of bytes past the actual location of the idRangeOffset word where the glyphIndexArray element corresponding to firstCode appears.

Finally, if the value obtained from the subarray is not 0 (which indicates the missing glyph), you should add idDelta to it in order to get the glyphIndex. The value idDelta permits the same subarray to be used for several different subheaders. The idDelta arithmetic is modulo 65536.

Format 4: Segment mapping to delta values

This is the Microsoft standard character to glyph index mapping table.

This format is used when the character codes for the characters represented by a font fall into several contiguous ranges, possibly with holes in some or all of the ranges (that is, some of the codes in a range may not have a representation in the font). The format-dependent data is divided into three parts, which must occur in the following order:

A four-word header gives parameters for an optimized search of the segment list;
Four parallel arrays describe the segments (one segment for each contiguous range of codes);
A variable-length array of glyph IDs (unsigned words).

Type	Name	Description
USHORT	format	Format number is set to 4.
USHORT	length	Length in bytes.
USHORT	version	Version number (starts at 0).
USHORT	segCountX2	2 x segCount.
USHORT	searchRange	2 x (2**floor(log2(segCount)))
USHORT	entrySelector	log2(searchRange/2)
USHORT	rangeShift	2 x segCount - searchRange
USHORT	endCount[segCount]	End characterCode for each segment, last =0xFFFF.
USHORT	reservedPad	Set to 0.
USHORT	startCount[segCount]	Start character code for each segment.
SHORT	idDelta[segCount]	Delta for all character codes in segment.
USHORT	idRangeOffset[segCount]	Offsets into glyphIdArray or 0
USHORT	glyphIdArray[ ]	Glyph index array (arbitrary length)

The number of segments is specified by segCount, which is not explicitly in the header; however, all of the header parameters are derived from it. The searchRange value is twice the largest power of 2 that is less than or equal to segCount. For example, if segCount=39, we have the following:

segCountX2 78

searchRange 64 (2 * largest power of 2 <=39)

entrySelector 5 log₂ (32)

rangeShift 14 2 x 39 - 64

Each segment is described by a startCode and endCode, along with an idDelta and an idRangeOffset, which are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values, and the segment values are specified in four parallel arrays. You search for the first endCode that is greater than or equal to the character code you want to map. If the corresponding startCode is less than or equal to the character code, then you use the corresponding idDelta and idRangeOffset to map the character code to a glyph index (otherwise, the missingGlyph is returned). For the search to terminate, the final endCode value must be 0xFFFF. This segment need not contain any valid mappings. (It can just map the single character code 0xFFFF to missingGlyph). However, the segment must be present.

If the idRangeOffset value for the segment is not 0, the mapping of character codes relies on glyphIdArray. The character code offset from startCode is added to the idRangeOffset value. This sum is used as an offset from the current location within idRangeOffset itself to index out the correct glyphIdArray value. This obscure indexing trick works because glyphIdArray immediately follows idRangeOffset in the font file. The C expression that yields the glyph index is:

*(idRangeOffset[i]/2 + (c - startCount[i]) + &idRangeOffset[i])

The value c is the character code in question, and i is the segment index in which c appears. If the value obtained from the indexing operation is not 0 (which indicates missingGlyph), idDelta[i] is added to it to get the glyph index. The idDelta arithmetic is modulo 65536.

If the idRangeOffset is 0, the idDelta value is added directly to the character code offset (i.e. idDelta[i] + c) to get the corresponding glyph index. Again, the idDelta arithmetic is modulo 65536.

As an example, the variant part of the table to map characters 10-20, 30-90, and 480-153 onto a contiguous range of glyph indices may look like this:

segCountX2: 8

searchRange: 8

entrySelector: 4

rangeShift: 0

endCode: 20 90 153 0Xffff

reservedPad: 0

startCode: 10 30 480 0Xffff

idDelta: -9 -18 -27 1

idRangeOffset: 0 0 0 0

This table performs the following mappings:

10 -> 10 - 9 = 1
20 -> 20 - 9 = 11
30 -> 30 - 18 = 12
90 -> 90 - 18 = 72
...and so on.

Note that the delta values could be reworked so as to reorder the segments.

Format 6: Trimmed table mapping

Type Name Description

USHORT format Format number is set to 6.

USHORT length Length in bytes.

USHORT version Version number (starts at 0)

USHORT firstCode First character code of subrange.

USHORT entryCount Number of character codes in subrange.

USHORT glyphIdArray [entryCount] Array of glyph index values for character codes in the range.

Type	Name	Description
USHORT	format	Format number is set to 6.
USHORT	length	Length in bytes.
USHORT	version	Version number (starts at 0)
USHORT	firstCode	First character code of subrange.
USHORT	entryCount	Number of character codes in subrange.
USHORT	glyphIdArray [entryCount]	Array of glyph index values for character codes in the range.

The firstCode and entryCount values specify a subrange (beginning at firstCode,length = entryCount) within the range of possible character codes. Codes outside of this subrange are mapped to glyph index 0. The offset of the code (from the first code) within this subrange is used as index to the glyphIdArray, which provides the glyph index value.

segCountX2	78
searchRange	64	(2 * largest power of 2 <=39)
entrySelector	5	log₂ (32)
rangeShift	14	2 x 39 - 64

segCountX2:	8
searchRange:	8
entrySelector:	4
rangeShift:	0
endCode:	20	90	153	0Xffff
reservedPad:	0
startCode:	10	30	480	0Xffff
idDelta:	-9	-18	-27	1
idRangeOffset:	0	0	0	0