Performance Optimizations

Microsoft DirectX 8.1 (Visual Basic)

Performance Optimizations

Every developer who creates real-time applications that use three-dimensional (3-D) graphics is concerned about performance optimization. This section provides you with guidelines about getting the best performance from your code.

General Performance Tips

Follow these general guidelines to increase the performance of your application.

Clear only when you must.
Minimize state changes and group the remaining state changes.
Use smaller textures, if you can do so.
Draw objects in your scene from front to back.
Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later.
Gracefully degrade special effects that require a disproportionate share of system resources.
Constantly test your application's performance.
Minimize vertex buffer switches.
Use static vertex buffers where possible.
Use one large static vertex buffer per FVF for static objects, rather than one per object.
If your application needs random access into the vertex buffer, choose a vertex format size that is a multiple of 32 bits. Otherwise, select the smallest appropriate format.
Draw using indexed primitives. This may allow for more efficient vertex caching within hardware.
If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time.
Do not copy to output registers unless necessary in shaders. For example:
```
mad oD0, r1, v0, c[3] 
```
rather than:
```
mad r1, v0, c0
mov oD0, r1
```

Databases and Culling

Building a reliable database of the objects in your world is key to excellent performance in Microsoft® Direct3D®. It is more important than improvements to rasterization or hardware.

You should maintain the lowest polygon count you can possibly manage. Design for a low-polygon count by building low-polygon models from the start. Add polygons if you can do so without sacrificing performance later in the development process. Remember, the fastest polygons are the ones you don't draw.

Batching Primitives

To get the best rendering performance during execution, try to work with primitives in batches and keep the number of render-state changes as low as possible. For example, if you have an object with two textures, group the triangles that use the first texture and follow them with the necessary render state to change the texture. Then group all the triangles that use the second texture. The simplest hardware support for Direct3D is called with batches of render states and batches of primitives through the hardware abstraction layer (HAL). The more effectively the instructions are batched, the fewer HAL calls are performed during execution.

Lighting Tips

Because lights add a per-vertex cost to each rendered frame, you can achieve significant performance improvements by being careful about how you use them in your application. Most of the following tips derive from the maxim, the fastest code is code that is never called.

Use as few light sources as possible. If you only need to increase the overall lighting level, use the ambient light instead of adding a new light source. It's much cheaper.
Directional lights are cheaper than point lights or spotlights. For directional lights, the direction to the light is fixed and doesn't need to be calculated on a per-vertex basis.
Spotlights can be cheaper than point lights, because the area outside the cone of light is calculated quickly. Whether spotlights are cheaper depends on how much of your scene is lit by the spotlight.
Use the range parameter to limit your lights to only the parts of the scene you need to illuminate. All the light types exit fairly early when they are out of range.
Specular highlights almost double the cost of a light. Use them only when you must. Set the D3DRS_SPECULARENABLE render state to 0, the default value, whenever possible. When defining materials, you must set the specular power value to zero to turn off specular highlights for that material; simply setting the specular color to 0,0,0 is not enough.

Texture Size

Texture-mapping performance is heavily dependent on the speed of memory. There are a number of ways to maximize the cache performance of your application's textures.

Keep the textures small. The smaller the textures are, the better chance they have of being maintained in the main CPU's secondary cache.
Do not change the textures on a per-primitive basis. Try to keep polygons grouped in order of the textures they use.
Use square textures whenever possible. Textures whose dimensions are 256×256 are the fastest. If your application uses four 128×128 textures, for example, try to ensure that they use the same palette and place them all into one 256×256 texture. This technique also reduces the amount of texture swapping. Of course, you should not use 256×256 textures unless your application requires that much texturing because, as mentioned, textures should be kept as small as possible.

Using Dynamic Textures

Dynamic textures are a new Microsoft® DirectX® 8.1 feature. To find out if the driver supports dynamic textures, check the D3DCAPS2_DYNAMICTEXTURES flag of the D3DCAPS8 structure.

Keep the following things in mind when working with dynamic textures.

They cannot be managed. For example, their pool cannot be D3DPOOOL_MANAGED.
Dynamic textures can be locked, even if they are created in D3DPOOL_DEFAULT.
D3DLOCK_DISCARD is a valid lock flag for dynamic textures.

It is a good idea to create only one dynamic texture per format and possibly per size. Dynamic mipmaps, cubes, and volumes are not recommended because of the additional overhead in locking every level. For mipmaps, LOCK_DISCARD is allowed only on the top level. All levels are discarded by locking just the top level. This behavior is the same for volumes and cubes. For cubes, the top level and face 0 are locked.

The following pseudocode shows an example of using a dynamic texture.

DrawProceduralTexture(pTex)
{
    // pTex should not be very small since overhead of calling driver every DISCARD
    // will not justify the performance gain. Experimentation is encouraged.
    pTex->Lock(DISCARD);
    <Overwrite *entire* texture>
    pTex->Unlock();
    pDev->SetTexture();
    pDev->DrawPrimitive();
}

Using Dynamic Vertex and Index Buffers

Dynamic vertex and index buffers have a difference in performance based the size and usage. The usage styles below help to determine whether to use D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE for the Flags parameter of the Lock method.

Usage Style 1:

for loop()
{
    pBuffer->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware 
                                          //doesn't stall by returning 
                                          //a new pointer.
    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.
    pBuffer->Unlock()
    Change state(s).
    DrawPrimitive() or DrawIndexedPrimitive()
}

Usage Style 2:

for loop()
{
    pVB->Lock(...D3DLOCK_DISCARD...); //Ensures that hardware doesn't 
                                      //stall by returning a new 
                                      //pointer.
    Fill data (optimally 1000s of vertices/indices, no fewer) in pBuffer.
    pBuffer->Unlock
    for loop( 100s of times )
    {
        Change State
        DrawPrimitive() or DrawIndexPrimitives() //Tens of primitives
    }
}

Usage Style 3:

for loop()
{
    If there is space in the buffer
    {
        // Append vertices/indices.
        pBuffer->Lock(…D3DLOCK_NOOVERWRITE…);
    }
    Else
    {
        // Reset to beginning.
        pBuffer->Lock(…D3DLOCK_DISCARD…);
    }
    Fill few 10s of vertices/indices in pBuffer
    pBuffer->Unlock
    Change State
    DrawPrimitive() or DrawIndexedPrimitive() // A few primitives

    }

Style 1 is faster than either style 2 or 3, but is generally not very practical. Style 2 is usually faster than style 3, provided that the application fills at least a couple thousand vertices/indices for every Lock, on average. If the application fills fewer than that on average, then style 3 is faster. There is no guaranteed answer as to which lock method is faster and the best way to find out is to experiment.

Using Meshes

You can optimize meshes by using Direct3D indexed triangles instead of indexed triangle strips. The hardware will discover that 95 percent of successive triangles actually form strips and adjust accordingly. Many drivers do this for legacy hardware also.

Direct3DX mesh objects can have each triangle, or face, tagged with a DWORD, called the attribute of that face. The semantics of the DWORD are user-defined. They are simply used by Direct3DX to classify the mesh into subsets. The application sets per-face attributes using the LockAttributeBuffer call. The Optimize method has an option to group the mesh vertices and faces on attributes using the D3DXMESHOPT_ATTRSORT option. When this is done, the mesh object calculates an attribute table that can be obtained by the application by calling GetAttributeTable. This call returns 0 if the mesh is not sorted by attributes. There is no way for an application to set an attribute table because it is generated by the Optimize method. The attribute sort is data sensitive, so if the application knows that a mesh is attribute sorted, it still needs to call Optimize to generate the attribute table.

The following topics describes the different attributes of a mesh.

Attribute ID

An attribute ID is a value that associates a group of faces with an attribute group. This ID describes which subset of faces DrawSubset should draw. Attribute IDs are specified for the faces in the attribute buffer. The actual values of the attribute IDs can be anything that fits in 32bits, but it is common to use 0 to n where n is the number of attributes.

Attribute Buffer

The attribute buffer is an array of DWORDs (one per face) that specifies which attribute group each face belongs in. This buffer is initialized to zero on creation of a mesh, but is either filled by the load routines or must be filled by the user if more than one attribute with ID 0 is desired. This buffer contains the information that is used to sort the mesh based on attributes in Optimize. If no attribute table is present, DrawSubset scans this buffer to select the faces of the given attribute to draw.

Attribute Table

The attribute table is a structure owned and maintained by the mesh. The only way for one to be generated is by calling Optimize with attribute sorting or stronger optimization enabled. The attribute table is used to quickly initiate a single draw primitive call to DrawSubset. The only other use is that progressing meshes also maintain this structure, so it is possible to see what faces and vertices are active at the current level of detail.

Z-Buffer Performance

Applications can increase performance when using z-buffering and texturing by ensuring that scenes are rendered from front to back. Textured z-buffered primitives are pretested against the z-buffer on a scan line basis. If a scan line is hidden by a previously rendered polygon, the system rejects it quickly and efficiently. Z-buffering can improve performance, but the technique is most useful when a scene includes a great deal of overdraw. Overdraw is the average number of times that a screen pixel is written to. Overdraw is difficult to calculate exactly, but you can often make a close approximation. If the overdraw averages less than 2, you can achieve the best performance by turning z-buffering off and rendering the scene from back to front.

On faster personal computers, software rendering to system memory is often faster than rendering to video memory although it has the disadvantage of not being able to use double buffering or hardware-accelerated clear operations. If your application can render to either system or video memory, and if you include a routine that tests which is faster, you can take advantage of the best approach on the current system. The Direct3D sample code in this software development kit (SDK) demonstrates this strategy. It is necessary to implement both methods because there is no other way to test the speed. Speeds can vary enormously from computer to computer, depending on the main-memory architecture and the type of graphics adapter being used.