Microsoft DirectX 8.1 (C++)

Performance Optimizations

Every developer who creates real-time applications that use three-dimensional (3-D) graphics is concerned about performance optimization. This section provides you with guidelines about getting the best performance from your code.

General Performance Tips

Follow these general guidelines to increase the performance of your application.

Databases and Culling

Building a reliable database of the objects in your world is key to excellent performance in Microsoft® Direct3D®. It is more important than improvements to rasterization or hardware.

You should maintain the lowest polygon count you can possibly manage. Design for a low-polygon count by building low-polygon models from the start. Add polygons if you can do so without sacrificing performance later in the development process. Remember, the fastest polygons are the ones you don't draw.

Batching Primitives

To get the best rendering performance during execution, try to work with primitives in batches and keep the number of render-state changes as low as possible. For example, if you have an object with two textures, group the triangles that use the first texture and follow them with the necessary render state to change the texture. Then group all the triangles that use the second texture. The simplest hardware support for Direct3D is called with batches of render states and batches of primitives through the hardware abstraction layer (HAL). The more effectively the instructions are batched, the fewer HAL calls are performed during execution.

Lighting Tips

Because lights add a per-vertex cost to each rendered frame, you can achieve significant performance improvements by being careful about how you use them in your application. Most of the following tips derive from the maxim, the fastest code is code that is never called.

Texture Size

Texture-mapping performance is heavily dependent on the speed of memory. There are a number of ways to maximize the cache performance of your application's textures.

Using Dynamic Textures

Dynamic textures are a new Microsoft® DirectX® 8.1 feature. To find out if the driver supports dynamic textures, check the D3DCAPS2_DYNAMICTEXTURES flag of the D3DCAPS8 structure.

Keep the following things in mind when working with dynamic textures.

It is a good idea to create only one dynamic texture per format and possibly per size. Dynamic mipmaps, cubes, and volumes are not recommended because of the additional overhead in locking every level. For mipmaps, LOCK_DISCARD is allowed only on the top level. All levels are discarded by locking just the top level. This behavior is the same for volumes and cubes. For cubes, the top level and face 0 are locked.

The following pseudocode shows an example of using a dynamic texture.

DrawProceduralTexture(pTex)
{
    // pTex should not be very small since overhead of calling driver every DISCARD
    // will not justify the performance gain. Experimentation is encouraged.
    pTex->Lock(DISCARD);
    <Overwrite *entire* texture>
    pTex->Unlock();
    pDev->SetTexture();
    pDev->DrawPrimitive();
}

Using Dynamic Vertex and Index Buffers

Dynamic vertex and index buffers have a difference in performance based the size and usage. The usage styles below help to determine whether to use D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE for the Flags parameter of the Lock method.

Usage Style 1:

for loop()
{
    pBuffer->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware 
                                          //doesn't stall by returning 
                                          //a new pointer.
    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.
    pBuffer->Unlock()
    Change state(s).
    DrawPrimitive() or DrawIndexedPrimitive()
}

Usage Style 2:

for loop()
{
    pVB->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware doesn't 
                                      //stall by returning a new 
                                      //pointer.
    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.
    pBuffer->Unlock
    for loop( 100s of times )
    {
        Change State
        DrawPrimitive() or DrawIndexPrimitives() //Tens of primitives
    }
}

Usage Style 3:

for loop()
{
    If there is space in the buffer
    {
        // Append vertices/indices.
        pBuffer->Lock(…D3DLOCK_NOOVERWRITE…);
    }
    Else
    {
        // Reset to beginning.
        pBuffer->Lock(…D3DLOCK_DISCARD…);
    }
    Fill few 10s of vertices/indices in pBuffer
    pBuffer->Unlock
    Change State
    DrawPrimitive() or DrawIndexedPrimitive() // A few primitives

    }
    

Style 1 is faster than either style 2 or 3, but is generally not very practical. Style 2 is usually faster than style 3, provided that the application fills at least a couple thousand vertices/indices for every Lock, on average. If the application fills fewer than that on average, then style 3 is faster. There is no guaranteed answer as to which lock method is faster and the best way to find out is to experiment.

Using Meshes

You can optimize meshes by using Direct3D indexed triangles instead of indexed triangle strips. The hardware will discover that 95 percent of successive triangles actually form strips and adjust accordingly. Many drivers do this for legacy hardware also.

Direct3DX mesh objects can have each triangle, or face, tagged with a DWORD, called the attribute of that face. The semantics of the DWORD are user-defined. They are simply used by Direct3DX to classify the mesh into subsets. The application sets per-face attributes using the LockAttributeBuffer call. The Optimize method has an option to group the mesh vertices and faces on attributes using the D3DXMESHOPT_ATTRSORT option. When this is done, the mesh object calculates an attribute table that can be obtained by the application by calling GetAttributeTable. This call returns 0 if the mesh is not sorted by attributes. There is no way for an application to set an attribute table because it is generated by the Optimize method. The attribute sort is data sensitive, so if the application knows that a mesh is attribute sorted, it still needs to call Optimize to generate the attribute table.

The following topics describes the different attributes of a mesh.

Attribute ID

An attribute ID is a value that associates a group of faces with an attribute group. This ID describes which subset of faces DrawSubset should draw. Attribute IDs are specified for the faces in the attribute buffer. The actual values of the attribute IDs can be anything that fits in 32bits, but it is common to use 0 to n where n is the number of attributes.

Attribute Buffer

The attribute buffer is an array of DWORDs (one per face) that specifies which attribute group each face belongs in. This buffer is initialized to zero on creation of a mesh, but is either filled by the load routines or must be filled by the user if more than one attribute with ID 0 is desired. This buffer contains the information that is used to sort the mesh based on attributes in Optimize. If no attribute table is present, DrawSubset scans this buffer to select the faces of the given attribute to draw.

Attribute Table

The attribute table is a structure owned and maintained by the mesh. The only way for one to be generated is by calling Optimize with attribute sorting or stronger optimization enabled. The attribute table is used to quickly initiate a single draw primitive call to DrawSubset. The only other use is that progressing meshes also maintain this structure, so it is possible to see what faces and vertices are active at the current level of detail.

Z-Buffer Performance

Applications can increase performance when using z-buffering and texturing by ensuring that scenes are rendered from front to back. Textured z-buffered primitives are pretested against the z-buffer on a scan line basis. If a scan line is hidden by a previously rendered polygon, the system rejects it quickly and efficiently. Z-buffering can improve performance, but the technique is most useful when a scene includes a great deal of overdraw. Overdraw is the average number of times that a screen pixel is written to. Overdraw is difficult to calculate exactly, but you can often make a close approximation. If the overdraw averages less than 2, you can achieve the best performance by turning z-buffering off and rendering the scene from back to front.

On faster personal computers, software rendering to system memory is often faster than rendering to video memory although it has the disadvantage of not being able to use double buffering or hardware-accelerated clear operations. If your application can render to either system or video memory, and if you include a routine that tests which is faster, you can take advantage of the best approach on the current system. The Direct3D sample code in this software development kit (SDK) demonstrates this strategy. It is necessary to implement both methods because there is no other way to test the speed. Speeds can vary enormously from computer to computer, depending on the main-memory architecture and the type of graphics adapter being used.