Models are normally created centered around a natural local origin. For instance, it makes sense to have the origin of a chair model be at floor level and centered under the chair. This helps make it easier to place the model in the world. The coordinates that define the model are relative to the origin of the chair model, of course, and are known as model coordinates.
The world transform controls how geometry is transformed from model coordinates into world coordinates. This transform can include translations, rotations, and scalings. You would use the world transform to place your chair model in a room and scale it with respect to the other objects in the room. The world transform applies only to geometry — it does not apply to lights. For an example of working with world transforms, see World Transform.
The view transform controls the transition from world coordinates into "camera space." You can think about this transformation as controlling where the camera appears to be in the world. For an example of working with view transforms, see View Transform.
The projection transform changes the geometry from camera space into "clip space" and applies the perspective distortion. The term "clip space" refers to how the geometry is clipped to the view volume during this transform. For an example of working with projection transforms, see Projection Transform.
Finally, the geometry in clip space is transformed into pixel coordinates (screen space). This final transformation is controlled by the viewport settings.
Clipping and transforming vertices must take place in homogenous space (simply put, space in which the coordinate system includes a fourth element), but the final result for most applications needs to be non-homogenous 3-D coordinates defined in "screen space." This means that both the input vertices and the clipping volume must be translated into homogenous space to perform the clipping and then translated back into non-homogenous space to be displayed.
The world, view and projection matrices are multiplied in that order to produce the combined transformation matrix [M]. An input vertex [x y z] is considered to be a homogenous vertex [x y z 1]. This vertex is multiplied by the combined 4×4 transform matrix [M] to obtain the output vertex [x1 y1 z1 w]. Following this multiplication, all input vertices are in "post-perspective homogenous space." Now that the vertices have been transformed and changed into homogenous space, the same thing must happen to the clipping volume; it is transformed into the post-perspective homogenous space and the clipping is performed. These clipped vertices (all of which lie within the clip volume) are now transformed back into post-perspective non-homogenous space. As a final step, the points are scaled so that the clip volume maps to the screen space viewport specified by the dwX, dwY, dwHeight, dwWidth members of the D3DVIEWPORT2 structure.