Applications that use the ramp driver can sometimes improve performance by using the D3DTBLEND_COPY texture-blending mode from the D3DTEXTUREBLEND enumerated type. This mode is an optimization for software rasterization; for applications using a HAL, it is equivalent to the D3DTBLEND_DECAL texture-blending mode.
Copy mode is the simplest form of rasterization and hence the fastest. When copy mode rasterization is used, no lighting or shading is performed on the texture. The bytes from the texture are copied directly to the screen and mapped onto polygons using the texture coordinates in each vertex. Hence, when using copy mode, your application's textures must use the same pixel format as the primary surface. They must also use the same palette as the primary surface.
If your application uses the monochromatic model with 8-bit color and no lighting, performance can improve if you use copy mode. If your application uses 16-bit color, however, copy mode is not quite as fast as using modulated textures; for 16-bit color, textures are twice the size as in the 8-bit case, and the extra burden on the cache makes performance slightly worse than using an 8-bit lit texture.
Copy mode implements only two rasterization options, z-buffering and chromakey transparency. The fastest mode is to simply map the texels to the polygons, with no transparency and no z-buffering. Enabling chromakey transparency accelerates the rasterization of invisible pixels because only the texture read is performed, but visible pixels will incur a slight performance degradation because of the chromakey test.
Enabling z-buffering incurs the largest performance degradation for 8 bit copy mode. When z-buffering is enabled, a 16 bit value has to be read and conditionally written per pixel. Even so, enabling z-buffering for copy mode can be faster than disabling it if the average overdraw goes over two and the scene is rendered in front-to-back polygon order.
If your scene has overdraw of less than 2 (which is very likely) you should not use z-buffering in copy mode. The only exception to this rule is if the scene complexity is very high. For example, if you have more than about 1500 rendered polygons in the scene, the sort overhead begins to get high. In that case, it may be worth considering a z-buffer again.
Direct3D is fastest when all it needs to draw is one long triangle instruction. Render state changes just get in the way of this; the longer the average triangle instruction, the better the triangle throughput. Therefore, peak sorting performance can be achieved when all the textures for a given scene are contained in only one texture map or texture page. Although this adds the restriction that no texture coordinate can be larger than 1.0, it has the performance benefit of completely avoiding texture state changes.
For normal simple scenes use one texture, one material, and sort the triangles. Use z-buffering only when the scene becomes complex.