Performance

Scalability is not much of a benefit if the initial performance is not satisfactory. It is always good to know that more and better hardware can take an application to its next evolutionary step, but what about the entry-level requirements? Don't all these high-end scalability features come at a price? Doesn't supporting every language from COBOL to Assembler necessarily compromise performance? Doesn't the ability to run a component on the other side of the world preclude running it efficiently in the same process as the client?

In COM and DCOM, the client never sees the server object itself, but the client is never separated from the server by a system component unless it's absolutely necessary. This transparency is achieved by a strikingly simple idea: the only way a client can talk to the component is through method calls. The client obtains the addresses of these methods from a simple table of method addresses (a "vtable"). When the client wants to call a method on a component, it obtains the method's address and calls it. The only overhead incurred by the COM programming model over a traditional C or Assembler function call is the simple lookup of the method's address (indirect function call vs. direct function call). If the component is an in-process component running on the same thread as the client, the method call arrives directly at the component. No COM or system code is involved; COM only defines the standard for laying out the method address table.

What happens when the client and the component are actually not as close—on another thread, in another process, or on another machine at the other side of the world? COM places its own remote procedure call (RPC)-infrastructure code into the vtable and then packages each method call into a standard buffer representation, which it sends to the component's side, unpacks it, and reissues the original method call: COM provides an object-oriented RPC mechanism.

How fast is this RPC mechanism? There are different performance metrics to consider:

How fast is an "empty" method call?
How fast are "real world" method calls that send and return data?
How fast is a network round trip?

The table below shows some real-world performance numbers for COM and DCOM to give an idea of the relative performance of DCOM compared to other protocols.

Parameter Size	4 bytes		50 bytes
	calls / sec	ms / call	calls / sec	ms / call
"Pentium®,," in-process	3,224,816	0.00031	3,277,973	0.00031
"Alpha™," in-process	2,801,630	0.00036	2,834,269	0.00035
"Pentium," cross-process	2,377	0.42	2,023	0.49
"Alpha," cross-process	1,925	0.52	1634	0.61
"Alpha," to Pentium remote	376	2.7	306	3.27

* These informal numbers were obtained on the author's Dell OptiPlex XM 5120 (120 MHz Pentium, 32MB RAM) and a small DEC™ Alpha-based RISC-machine (200 MHz, 32MB RAM). Both machines were running the release version of Windows NT 4.0 (Build 1381). DCOM was using UDP over Intel® EtherExpress PRO network cards (10 Mbps) on the Microsoft corporate network under a normal load. The COM Performance Sample - available in the Windows NT 4.0 Win32 SDK - can be used to obtain similar numbers with other configurations.

The first two columns represent an "empty" method call (passing in and returning a 4-byte integer). The last two columns can be considered a "real world" COM method call (50 bytes of parameters).

The table shows how in-process components obtain zero-overhead performance (rows 1 and 2).

Cross-process calls (rows 3 and 4) require the parameters to be stored into a buffer and sent to the other process. A performance of roughly 2000 calls per second on standard desktop hardware, satisfies most performance requirements. All local calls are completely bound by processor speed (and to some extent by available memory) and scale well on multi-processor machines.

Remote calls (rows 5 and 6) are primarily network bound and indicate approximately 35% overhead of DCOM over raw TCP/IP performance (2 ms roundtrip time for TCP/IP).

Microsoft will soon provide formal DCOM performance numbers on a wide range of platforms, that show DCOM's ability to scale with the number of clients and with the number of processors on the server.

These informal - but reproducible - performance numbers indicate an overhead of approximately 35% of DCOM over raw TCP/IP for empty calls. This ratio decreases further as the server performs actual processing. If the server requires 1 ms - for example to update a database - the ratio decreases to 23% and to 17% if the server requires 2 ms.

The overall performance and scalability advantages of DCOM can only be reached by implementing sophisticated thread-pool managers and pinging protocols. Most distributed applications will not want or need to incur this significant investment for obtaining minor performance gains, while sacrificing the convenience of the standardized DCOM wire-protocol and programming model.