Does a device have to be at 100% utilization to be the bottleneck? Unfortunately, no, or our lives would be much simpler.
Two important issues contribute to the relationship between queue lengths and utilization. The first is the arrival patterns of requests for the device service, and the second is the amount of work the device is requested to do on each arrival.
Suppose we had ten threads that each wanted exactly 0.9 seconds of processor time in a continuous block, just once every 10 seconds. Further imagine that exactly 1 second after the first one arrived to ask for its 0.9 seconds, the second one arrived, and in the next second the third one arrived, and so on. The processor would be 90% busy, and there would be no queue. If each thread needed precisely 0.95 seconds of processor time, the processor would be utilized at 95%. If 0.999, then 99.9%. Note there is no queuing in this situation, and no interference between the threads.
In queuing theory this is called a constant arrival and constant service distribution. In a carefully engineered situation, a device can be nearly 100% busy without creating a queue. How delicate the balance between arrivals and service to achieve this state!
It is not hard to see that, if the second thread arrives just a little bit early, it has to wait for the first one to at least complete a time slice before it can run. Likewise, if the first thread requires just a little more than a second of processor time, it will get in the way of the next thread. A processor queue will start to form. Once arrival rates and service requests become unpredictable, a queue can build up, and there can be delays.
Figure 3.23 Response time to a randomized processor load
Figure 3.23 shows how the response time can grow for a number of threads if the arrivals and service requests are less regular. These threads ask for a somewhat random amount of processor time after an irregular delay. The utilization of the processor in the 25 threads on the right was 76%. Yet the delays experienced were almost three times that of the stand-alone thread. This means the length of the processor queue was almost 2.
According to queuing theory, if the arrival pattern is random and the service pattern is random, the length of the queue is 2 when the device utilization is 66%, or two-thirds utilized. We are using the word "random" somewhat loosely to mean unpredictable. For example, in a large telephone exchange, the length of the phone calls is found to be random in this way. (In fact, what we mean is that interarrival and service distributions are exponential, but that is more formal than we need to be here.)
Distributions can be worse than this, in the sense that queues can form at even lower utilizations. The most commonly occurring situation like this is a bimodal distribution of service, when most requests are very short or long with few that are medium length. We don't see these too often in computer systems, but they do occur. If your queue length is large and the utilization is low, you may be experiencing this type of usage pattern. If you want to impress your boss, say that your device is experiencing hyper-exponential service distributions.
So how a device is used determines the length of the queue that will form for a given utilization of the device. When looking for bottlenecks in real systems, you must be aware of this. It won't be hard to remember because it will apply at the bank and the supermarket as well as on your computer. So next time you are standing in line somewhere, you can quote Rule #8: