Based Pointers: Combining Far Pointer Addressability and the Small Size of Near Pointers

Richard Hale Shaw

Based pointers, a new data type introduced in Microsoft C Version 6.0, are a practical alternative to traditional pointers for some applications. Based pointers combine the size of near pointers with the ability of far pointers to address objects in segments outside your program's default data segment; that is, far segments. Their name comes from the way they are used-you specify the segment on which they are based. When you use a based pointer, the compiler usually generates code to load the ES register with a specific base (the segment upon which the pointer is based). You probably won't use based pointers to address objects in your program's default data segment, but they are useful when working on objects in parallel. They offer some distinct advantages over far pointers, including a flexibility that far pointers lack. A brief review of far, near, and huge pointers follows.

Near Pointers

Near pointers are native to memory models that use a single 64Kb segment for code or data. For data pointers, this includes the tiny, small, and medium memory models, in which data segments don't exceed 64Kb. Near code pointers (function pointers) are found in models that limit code to 64Kb, like the tiny, small, and compact models. For the purposes of this article, however, further discussion focuses on pointers to data.

Near pointers are simple 16-bit pointers that contain an offset to an object in a program's data segment. The compiler combines the program's data segment with the offset to create a 32-bit address from the near pointer. Since the DS register contains this segment value most of the time, there is usually no need to reload this register. Besides, a number of assembly language operations assume that DS contains a data object's segment address.

While near pointers are the smallest and fastest pointers to use, they are also the most limited. You can access only 64Kb of data with them, the 64Kb in the program's data segment. If you compile a small model program and attempt to increment a near pointer past the 65,536th byte, the pointer will be reset to 0. For example, the following code leaves the pointer set to the base of the data segment, byte 0.

// set pointer to last byte

char _near *p = (char _near *)0xffff;

p++; // increment pointer

(Note that the _near keyword is unnecessary if you are compiling with the small or tiny model, just as _far can be omitted when compiling with the large model.)

Since near pointers don't usually require segment register reloads and use only 2 bytes, they have an advantage of size and speed over far pointers. However, their ability to address only the current data segment's 64Kb limits their usefulness in large or complex applications.

Far Pointers

A far pointer is a variation of a near pointer. Rather than let the compiler use the program's data segment as the segment portion of the address (as near pointers do), the segment value is stored in the pointer. Thus, far pointers use 4 bytes to store a complete 32-bit address. Since this address contains both the segment and offset of the object being addressed, far pointers can be used by any memory model. However, far data pointers are native to the compact and large models.

Because far pointers contain both a segment and an offset, the segment portion of a far pointer must be reloaded when it is used. Thus, they are slower than near pointers. On the other hand, far pointers can reference any data object, anywhere in memory.

Far pointers cannot address more than 64Kb at a time, or access all of a data object more than 64Kb in size. Far pointers behave like near pointers as far as address arithmetic is concerned. When you increment a far pointer, only the offset is affected: if you increment it past the end of the segment, the offset will wraparound and begin incrementing from 0 while the segment portion remains the same:

// set to B800:FFFFH

char _far *fp = (char _far *)0xb800ffff;

fp++; // increment to B800:0000H

In this example, the far pointer is set to the last byte of the segment at B8000H. Once the pointer is incremented, the pointer is set to the first byte of the segment, at B800:0000H. Thus, like near pointers, far pointers can only contain offset values from 0 to 65535. While they occupy more space and are slower to manipulate, they can reference any object in any segment, as long as the object itself does not extend beyond the end of the specified segment.

Huge Pointers

Huge pointers are the only true pointers in the K&R sense of the word. While physically the same as far pointers, storing a 32-bit address in a 4-byte space, huge pointers behave differently. When you increment a huge pointer beyond the end of a segment, it increments into the next segment. Therefore, huge pointers can address any object of any size-even objects that extend over several segments. For instance, the following code causes the huge pointer to address the next physical byte, from B800:FFFFH to C800:0000H.

// set to B800:FFFFH

char _huge *hp = (char _huge *)0xb800ffff;

hp++; // increment to C800:0000H

It does not wraparound like near and far pointers.

Using huge pointers, you can write a program to access every memory location in your machine.

// define 1 megabyte macro

#define MAXBYTES (1024L*1024L)

main(void)

{

long l;

unsigned char _huge *uch;

FP_SEG(uch) = 0;

FP_OFF(uch) = 0;

for(l = 0L; l < MAXBYTES; l++, uch++)

*uch = \0';

}

While this is a nonsensical program (nonsensical in that it would attempt to overwrite every writable memory location in your PC, including itself and DOS), it illustrates the addressing ability of huge pointers. This is why huge pointers are the only pointers supported by Microsoft C that behave like pointers in a flat-addressing scheme. Huge pointers are always slower, since the compiler must generate code that can perform pointer arithmetic on the entire 32-bit value, unlike near and far pointers where address arithmetic is only performed on the offset component. Near and far pointers are strictly products of the segmented architecture of the 80x86 CPU family on which DOS PCs are based1.

Clearly, pointers in Microsoft C have specific attributes. Huge pointers can address any object of any size; near and far pointers cannot address any object outside of their designated segments. Far and huge pointers contain a 32-bit address and occupy 4 bytes; near pointers occupy 2 bytes and contain a 16-bit address.

Based Pointers

As mentioned, based pointers combine the intersegment addressing features of far pointers with the size and possibly the speed of near pointers. Like near pointers, based pointers occupy only 2 bytes to create a 16-bit offset; like far pointers, they can address up to 64Kb of any segment. However, far addressing still requires a 32-bit address, while based pointers contain only the 16-bit offset.

The base segment of near pointers is implicitly the program's data segment. The "base" of far and huge pointers is contained in the segment portion of the pointer. When using based pointers, you specify the base, but it is not stored in the pointer itself.

Where does the segment portion of a based pointer come from? It's actually a compiler trick. Using the new based pointer notation of C 6.0, you can declare a based pointer and specify its segment base. Then, depending on what type of based pointer you use, the compiler "knows" from the declaration which segment the based pointer is based on. There are six types of based pointers: segment based pointers, variable based pointers, void based pointers, named segment based pointers, pointer based pointers, and pointers based on self.

When it encounters a based pointer in use, the compiler generates code to load a segment register (usually ES) with the pointer's associated segment base. Then the compiler proceeds with the operation entailing the based pointer. Of course, the based pointer itself supplies the offset portion of the address.

Segment Based Pointers

Suppose you wanted to base a pointer on the segment associated with the video display of an EGA or VGA adapter:

_segment videoseg = 0xb800;

unsigned _based(videoseg) *bpvid = 0;

This small piece of code does several things. It creates videoseg, which is a _segment data type. The _segment data type, new in C 6.0, is similar to an unsigned integer. It is used to store a segment value for pointers based on it. The next line of code uses the _based keyword to create a based pointer, bpvid, whose base is the segment stored in videoseg. The _based keyword must be followed by a valid base expression in parentheses. However, you cannot use an unsigned integer constant as below:

// WRONG!

unsigned _based((_segment)0xb800) *bpvid;

For illustrative purposes, here's how to use a far pointer to address the same location:

unsigned _far *fpvid = (unsigned _far *)0xb8000000;

While this far pointer is set to the same location, it's not the same as the based pointer. To use fpvid on another segment, you have to reset the entire pointer (or at least use the FP_SEG macro to reset the segment portion of the pointer). To use bpvid on another segment, you only have to change videoseg. The big advantage here, of course, is that you can create an entire set of pointers based on the same _segment variable, and change their base by making a single change to the _segment variable.

In the following code, pspptr is a pointer to the program segment prefix (PSP) structure, but it's based on the segment value stored in psp.

#include<dos.h>

_segment psp;

typedef struct _psp

{

unsigned int20;

unsigned allocblockseg;

char reserved0;

unsigned char dosfunctdispatch[5];

unsigned long int22;

unsigned long int23;

unsigned long int24h;

char reserved1[22];

unsigned envseg;

} PSP;

PSP _based(psp) *pspptr = 0;

If an application knows the segment address of another application's PSP, it can access it by assigning that segment address to psp. Then pspptr can reference members of the other application's PSP. Thus, pspptr>envseg provides the segment address of a program's copy of the environment. For a more explicit example, see the listings of ENVIRON.C (see Figure 1). You can compile this program using EN.BAT (also in Figure 1). ENVIRON.C uses a based pointer to access its own PSP and another based pointer (whose base is a segment variable set to the environment set address in the PSP) to access the program's copy of the environment strings.

Variable Based Pointers

You can also use _segment as a cast to extract the segment address of an object. Thus, a based pointer can be based on the storage segment of another object. In the following code, cbpsv is based on the segment in which count is stored, and cbpsp is based on the segment that contains cp.

unsigned count;

char *cp;

char _based((_segment)&count) *cbpsv;

char _based((_segment)cp) *cbpsp;

The _segment cast is read as "segment of" in this context. This usage differs from pointers based on other pointers, which is discussed below.

Using explicit segment addresses is only one way to declare a based pointer's base. You can use the segment's name or derive the segment from the address pointed to by a pointer. Based pointers that are members of a structure can be based on the segment in which the structure resides. Or, as shown in the next section, you can omit the segment declaration altogether.

Void Based Pointers

You don't always have to include a based pointer's segment with the pointer declaration.

unsigned _based(void) *ubpv = 0;

Here, ubpv is a based pointer whose base segment is omitted from the declaration. But a based pointer must be based on something! This method lets you defer naming the segment only until the based pointer is used. Whenever you reference a void based pointer, you must specify a segment value.

_segment videoseg = 0xb800;

unsigned _based(void) *ubpv = 0;

*(videoseg:>ubpv) = (0x70 << 8) + H';

This piece of code uses videoseg (declared earlier) to provide a base for ubpv. The code displays a reverse-video H at the upper-left corner of an EGA or VGA screen. The new ":>" base operator lets you combine the segment address stored in videoseg with the offset in ubpv. You can only use the base operator on pointers based on void, however. If you attempt to use it with a based pointer not based on void, the code wilol not compile.

The big advantage of void based pointers is that you can set them to a particular offset and then use them with different base addresses. You can also access the same data in different segments using the same pointer simply reference a different base each time. (Rather than changing a _segment variable, you can change the base whenever you use the pointer.)

Most of the time you'll want to base a based pointer on a specific segment address, as discussed above. But you may instead want to base it on the address of a named segment.

Named Segment Based Pointers

Programmers who are writing large or medium model applications often make each program module a named segment, with near calls and near data inside the module. While this is a useful trick, it's nearly impossible to reference the named segment outside of the module. Modules such as these are usually named by the compiler via a compile line switch.

The based pointer support in C 6.0 lets you create new segments on-the-fly. Additionally, it lets you designate which objects and pointers are based in these segments. For instance, the following code creates a pointer whose base is the program's code segment:

void _based(_segname("_CODE")) *vbpc;

The next line of code creates a pointer based on the program's data segment:

void _based(_segname("_DATA")) *vbpd;

This isn't really that useful, since it's the functional equivalent of a near data pointer:

void *vptr;

However, the ability to base pointers on specific named segments can be extremely useful. The _segname operator must be followed by a string that names a segment in parentheses. This can be any of the predefined segment names used in Microsoft C such as _CODE, _DATA, _CONST, or _STACK.

If you use a segment name that differs from these, the compiler will create a new segment. For instance the following creates a pointer based on a new segment, _NEWSEG:

void _based(_segname("_NEWSEG")) *vbpm;

The compiler, upon encountering this line of code, will create the new segment. But because it doesn't do you any good to base a pointer on a segment that doesn't contain anything, you can also embed objects in the new segment. For instance, the following will create a string array that is stored in _NEWSEG and referenced via newseg_message.

char _based(_segname("_NEWSEG")) newseg_message[] =

"This string is stored in the segment, _NEWSEG";

Note that if you make newseg_message a pointer instead of an array, you'll create a based pointer to a near string:

char _based(_segname("_NEWSEG")) *newseg_message =

(char _based(_segname("_NEWSEG")) *)

"This string is stored in the segment, _NEWSEG";

The cast is required to eliminate a compiler warning.

If you've nearly exhausted data space in a small model program, you might be able to avoid moving to the compact model by placing some of the data in another segment. For instance, if your code segment is still well below 64Kb, you can store some of the program's data there. This code stores the string at codeseg_message in the _CODE segment.

char _based(_segname("_CODE")) codeseg_message[] =

"This string is stored in the _CODE segment";

Keep in mind that any data referenced from another segment may require that a segment register be reloaded, so access to this data will be slower than that in the default data segment.

Pointers Based on Pointers

Suppose you want to base a pointer on the address another pointer points to.

unsigned _near *ip;

unsigned _far *fip;

unsigned _based(ip) *ibpip;

unsigned _based(fip) *ibpfip;

Here are two pointers and two based pointers. Each of the based pointers is based on the address that one of the "regular" pointers points to. Therefore, if the far pointer, fip, points to B800:0000H and ibpfip is set to 5, then *ibpfip addresses B800:0005H. Any change to fip will affect ibpfip: ibpfip will always be set to an address calculated by adding its value to the address of fip.

This construction can be used in many ways. Suppose an application uses a multidimensional array and makes changes to more than one dimension of the same element. You can use one pointer to access the array's first dimension and based pointers (based on the address the first pointer points to) to access the other dimensions. If the based pointers are each set one dimension apart, you can easily access the other dimensions in parallel. Alternatively, you could allocate a far segment, and create a far pointer to reference the segment base. Then you could use a based pointer (based on the far pointer) to access parts of the same segment. You might be thinking that this would be useful to create a linked list in a far segment. But another type of based pointer would be more useful.

Pointers Based on Self

The remaining type of based pointer is self based; that is, its base is the segment in which it is stored. Functionally speaking, near pointers are self based. But you cannot use a near pointer in a far segment, nor is a far pointer based on the segment in which it is stored.

For instance, consider the following linked list code:

typedef struct _list LIST;

struct _list

{

void *item;

LIST _based((_segment)_self) *left;

LIST _based((_segment)_self) *right;

};

void main(void)

{

LIST _based(_segname("LISTSEG")) list;

}

The LIST object type is defined, including two self based pointers, using the new _self keyword. These pointers will be based on the segment in which the LIST is created. As you can see, list is declared as being stored in the LISTSEG segment. Therefore, the left and right pointers will point to offsets in the LISTSEG segment, since they are self based.

Self based pointers are specifically designed to be used in far segments. They make it easy to define complex data structures, as well as making them portable and easy to declare. Like pointers based on void, self based pointers defer choosing a base. (In a data structure, a self based pointer will be based on the segment in which the data structure is stored.) That is, its segment must be statically known at compile time. That way it will point to an offset within the segment.

Linked List Manager

To illustrate the features, properties and benefits of based pointers further, I've written a linked list manager (see the sidebar "The List Manager API" and its accompanying code). As most programmers know, linked lists are incredibly useful tools. They are generally simpler and faster to use than disk files. List items can be added, deleted, and sorted quickly and with little overhead. In single-linked lists, each node contains a pointer to the next node in the list, allowing a search from any node forward. In the List Manager, each node contains pointers to the nodes that precede and follow it. You can use these double-linked lists to search in either direction.

I believe that linked lists would be used more frequently if not for the limitations inherent in most implementations. Lists are notorious for fragmenting and exhausting a program's heap through the constant addition and deletion of nodes. It's also quite a chore to save list items to disk: you have to write each item to the file one at a time. It's even more complexto restore the list from the disk file. In addition to reading the data, the list linkages have to be restored.

The functions in the List Manager presented here use based pointers to alleviate these problems. Instead of maintaining the list in the calling program's data segment, the list is stored in a new segment. The List Manager allocates this segment outside the program's data space and uses based pointers to manage suballocation within the segment. Since a separate segment is associated with each list, you can use and maintain multiple lists of up to 64Kb each. Even if you're compiling with the small or tiny memory model, the list will occupy no more than 4 bytes (for a handle) in your program's default data segment. Also, the lists won't wreak havoc on your program by fragmenting the heap.

Placing the list in a single separate segment also facilitates its storage. The entire segment can be written as a single unit, making it a snap to save it and restore it from disk. Since the linkages are retained inside the segment, you can read them back from the disk file without having to rebuild them. You can also extend the List Manager to use EMS memory if you wish.

Limiting a list to a single 64Kb segment might seem constraining, especially if you're using it to juggle large objects such as graphic images. Each node therefore contains a far pointer that can be used to track objects stored outside the list's segment. Using a far pointer, you can track objects up to 64Kb each in size, and even store them in EMS. Finally, each node offers a pointer to the node's name. This can also be used to store string data if the object pointer isn't needed.

Based pointers make all of this possible. While your application uses a far pointer as a list handle, the nodes and linkages are connected via based pointers. Since allocation of the based pointers has to take place inside a far segment, the new _bmalloc functions are used. These functions, discussed in the sidebar, allow an application to suballocate a previously allocated far segment. Therefore, the list functions call _bheapseg to allocate a new far segment, _bmalloc to suballocate a new node within the segment, and _bfree to free the node. Other functions (discussed in the sidebar) are available although not used by the list functions.

How the List Manager Works

Since the List Manager has to be able to create new lists dynamically, it uses a far memory segment that is allocated while an application is running. Because of this, there is no previously declared base for pointers based on this segment. The data structures for the list header and nodes (items) therefore cannot use self based pointers, since a self based object requires a specific base expression when an instance of it is created. Instead, the List Manager maintains a static segment variable, _tempseg, which is used as the base for the List Manager's pointers. Since the List Manager allows the use of multiple lists, _tempseg must always contain the segment of the list being manipulated by a List Manager function. Thus, each function begins with the _listinit macro, which initializes _tempseg with the address of the segment in which the list is stored.

The List Manager makes extensive use of two data structures, both of which are stored in the list's far segment. The first of these is a control block for the entire list:

typedef struct _list LIST;

typedef struct _litem LISTITEM;

struct _list

{

_segment seg; // list segment

unsigned num; // number of nodes

unsigned segsize; // size of segment

char _based(_tempseg) *name // name of list

LISTITEM _based(_tempseg) *item;

// first node or item

};

The LIST data type contains the list's segment address, segment size, and the number of nodes. It also contains pointers to the list's name and the first node or item in the list. Each list item is maintained in the LISTITEM structure as follows:

struct _litem

{

void far *object; // object pointer

unsigned objectsize; // object's size

char _based(_tempseg) *name; // item name

LISTITEM _based(_tempseg) *prev;

// previous item in list

LISTITEM _based(_tempseg) *next;

// next item in list

};

An instance of this structure is allocated for each item in the list. It lets the List Manager maintain pointers to the previous and next list items, as well as a pointer to a far object. The object pointer can point to any object anywhere in memory, allowing maximum flexibility. You can set this pointer to an object maintained by your application, or it can point to a large object stored in its own segment (for an example, see the animation routines discussed below). The name pointer allows further flexibility. If you want to name a list item or the object it points to, you can use this pointer. Or, you can use it instead of the object pointer to store the actual data item to which the node refers.

All the pointers (with the exception of the object pointer) are based on the _tempseg variable, which is always loaded with a particular list's segment. Every component of a list that is allocated by the List Manager is stored in that list's segment, making it easy to maintain a list, as well as save and restore it to and from a disk file.

List Manager API

The List Manager can be placed in a library, or linked into an application as an object file. In either case, the calling application does not need to know anything about the inner workings of the List Manager. A small API is available to use instead. An application calls ListInit to create a list, and it returns a handle (a void far pointer) to the new list. ListInit requires the list name and the suggested number of nodes that will be used. ListInit then uses this number to "guess" the size of the far segment to be allocated.

Internally, ListInit calls _ListInit to create the new segment and return a LIST pointer to it. _ListInit uses _bheapseg to create the new far segment, and then calls _bmalloc to suballocate space for the LIST header structure. Since _bmalloc returns _NULLOFF instead of NULL upon failure, any uninitialized pointers in the List Manager structures are set to _NULLOFF.

After a program creates a LIST and receives a far pointer to it, the program can call other List Manager routines to build and manipulate the list. For example, ListAdd calls _bmalloc to suballocate a new list item and add the item to the tail of the list. ListDelete will remove an item from the list; if the item has a related object (via the object pointer), the pointer is returned to the calling application to free it. The API also contains two functions, ListDump and ListDumpItem, for debugging list nodes and pointers. Other functions are briefly described in the sidebar "The List Manager API."

Note the two functions, ListSave and ListRestore. ListSave lets you save a list to a specified disk file. It does this by first writing a structure that includes the segment size and the offset portion of the list handle pointer. Next, ListSave writes the entire list segment to the disk file, and then writes each item's object (if it has one) to the file.

ListRestore can restore an entire list from the disk. The function first opens the disk file and reads the structure written at the beginning of the file. It uses the segment size to allocate a segment of the proper size without having to read the list segment from the disk. While ListRestore calls _ListInit to create the new segment, the LIST pointer it returns may not correspond to the one originally saved in the file. Therefore, the offset portion of the LIST handle is set to the offset portion of the list handle pointer that was previously stored at the beginning of the file. After this, ListRestore reads the entire segment into the newly allocated segment, and reads each item's objects into far memory segments that it creates.

Sample Programs

Several sample programs are presented with this article. TESTLIST.C (see Figure 2) creates a list, adds several items to it, then calls ListDump to display the list items and connections. MAKANIM.C (see Figure 3) and ANIMATE.C (see Figure 4) display a brief, repetitive animation sequence using a List Manager's linked list. I used a screen capture program to create a series of screen shots in which a box is drawn and bounces between the top and bottom of the screen. MAKANIM should be compiled and run first: it creates the list by reading the screen shot files (which are available on all MSJ bulletin boards) and making each an object related to a list node. Since the objects are designed for display on an EGA or VGA monitor, only the first 4Kb of each screen shot file (.ACP) is used. The program then saves the list in a file, ANIMATE.LST. ANIMATE.EXE restores the list from ANIMATE.LST, displaying the screen shots over and over in a loop. A key press terminates the loop.

Once you've compiled MAKANIM.C and ANIMATE.C, run MAKANIM, followed by ANIMATE.

MAKANIM

ANIMATE ANIMATE.LST [/b]

The optional /b switch will cause ANIMATE to call a based pointer screen display routine. You'll find that the difference in speed between a based pointer and a far pointer is almost unnoticeable.

Conclusion

Based pointers are a product of segmented architecture. They further blur the boundaries between code and data by allowing you to store data in a code segment. They offer an alternative that is similar to near pointers, while allowing access to data typically reached via a far pointer. They also make it easier to transfer complex data structures between disk, EMS, and conventional memory. Based pointers are not without limitations. The compiler is sometimes too strict about operations such as assigning one based pointer to another. If both pointers use different base names that refer to the same physical base, the compiler generates a warning. You'll have to use a cast to eliminate this. Also, since the compiler generates code to load the ES register when a based pointer is used, repeated reloads of that register can generate slow-performing code. This can be a problem especially when using a variety of based pointers with different bases. On the other hand, the generated code can be very fast if you're using several pointers with the same base. u