John Murray
September 1997
High-performance embedded applications must often manage time-critical responses. Examples include manufacturing process controls, high-speed data acquisition devices, telecommunications switching equipment, medical monitoring equipment, aircraft "fly-by-wire" controls, and weapons delivery, space navigation and guidance, laboratory experiment control, automobile engine control, and robotics systems.
Validating such an application means examining not only its computational accuracy but also the timeliness of its results. The application must deliver its responses within specified time parameters—in real time.
A real-time system is loosely defined as "any system where a timely response by the computer to external stimuli is vital," in Real Time Systems by C.M. Krishna and Kang G. Shin (McGraw Hill, 1996). The standard definition provided in the Internet newsgroup comp.realtime states: "A real-time system is one in which the correctness of the computations not only depends upon the logical correctness of the computation but also upon the time at which the result is produced. If the timing constraints of the system are not met, system failure is said to have occurred." in the comp.realtime "Frequently Asked Questions" by Mark Linimon and Jean-Christophe Monfret.
It is important to distinguish between a real-time system and a real-time operating system (RTOS). The real-time system represents the set of all system elements—the hardware, operating system, and applications—that are needed to meet the system requirements. The RTOS is just one element of the complete real-time system and must provide sufficient functionality to enable the overall real-time system to meet its requirements.
It is also important to distinguish between a fast operating system and an RTOS. Speed, although useful for meeting the overall requirements, does not by itself meet the requirements for an RTOS. The Internet newsgroup comp.realtime lists some requirements that an operating system must meet to be considered an RTOS:
In addition, the operating system behavior must be predictable. This means real-time system developers must have detailed information about the system interrupt levels, system calls, and timing:
This article describes how Microsoft® Windows® CE meets each of these requirements for a real-time operating system. Most significantly, Windows CE guarantees an upper bound on the time it takes to start a real-time priority thread after receiving an interrupt. This article describes the interrupt latency times for a specific reference platform, the "Odo" platform with the Hitachi SH3 microprocessor.
Microsoft Windows CE is designed as a general-purpose operating system for small form-factor devices that are typically diskless systems with a limited memory capacity. Windows CE is adapted for a specific hardware platform by creating a thin layer of code that resides between the kernel and the hardware. This is known as the Hardware Abstraction Layer (HAL). (In previous releases, this interface was called the OEM Adaptation Layer, or OAL, and the Kernel Abstraction Layer, or KAL, to avoid confusion with the Windows NT® HAL.)
Unlike other Windows operating systems, Windows CE does not represent one standard, identical piece of software that is common to all platforms. To be flexible enough to meet the needs of a wide range of products, Windows CE is modular. This means that it can be custom-built for a product by selecting from a provided set of software modules. In addition, some of the available modules are componentized. This means these modules can be custom-built by selecting from a set of available components. By selecting the minimum set of required modules and components that meet the system requirements, the OEM can minimize the memory footprint and performance of the operating system.
The Microsoft Windows CE Embedded Toolkit for Visual C++® 5.0 provides the system libraries, tools, documentation, and sample code to enable OEMs to customize Windows CE for their specific hardware platforms. For more information about the toolkit, see the Microsoft Windows CE Web site (http://www.microsoft.com/windowsce/developer/prodinfo/vcceembed.htm). The toolkit will also be available to Universal Subscribers of the MSDN Library. The embedded toolkit also includes the Device Driver Kit (DDK) and the Software Development Kit (SDK). The DDK provides additional documentation about writing device drivers. The SDK provides libraries, header files, sample code, and documentation that allows developers to write applications for Windows CE platforms. Windows CE offers subsets of the same programming interfaces used to develop applications for other Windows operating systems. For example, Windows CE version 1.01 supported about 500 of the 1000 Microsoft Win32® application programming interface (API) functions. This means that a wide variety of tools, third-party books, and training courses for Win32 developers are already in place and available for Windows CE system developers.
Real-time systems developers can use the Windows CE Embedded Toolkit for Visual C++ 5.0 to port the operating system to their specific platform, and to develop additional device drivers and real-time applications for that platform.
Windows CE is a preemptive multitasking operating system. It allows multiple applications, or processes, to run within the system at the same time. Windows CE supports a maximum of 32 simultaneous processes. A process consists of one or more threads, where each thread represents an independent portion of that process. One thread is designated as the primary thread for the process; the process can also create an unspecified number of additional threads. The actual number of additional threads is limited only by available system resources.
Windows CE uses a priority-based time-slice algorithm to schedule the execution of threads. Windows CE supports eight discrete priority levels, from 0 through 7, where 0 represents the highest priority. These are defined in the header file Winbase.h.
Table 1. Thread priority levels
Priority level | Constant and description |
0 (highest priority) | THREAD_PRIORITY_TIME_CRITICAL (highest priority) |
1 | THREAD_PRIORITY_HIGHEST |
2 | THREAD_PRIORITY_ABOVE_NORMAL |
3 | THREAD_PRIORITY_NORMAL |
4 | THREAD_PRIORITY_BELOW_NORMAL |
5 | THREAD_PRIORITY_LOWEST |
6 | THREAD_PRIORITY_ABOVE_IDLE |
7 (lowest priority) | THREAD_PRIORITY_IDLE (lowest priority) |
Levels 0 and 1 are typically used for real-time processing and device drivers; levels 2–4 for kernel threads and normal applications; and levels 5–7 for applications that can always be preempted by other applications. Note that level 6 is present, and has a constant associated with it.
Preemption is based solely on the thread's priority. Threads with a higher priority are scheduled to run first. Threads at the same priority level run in a round-robin fashion with each thread receiving a quantum, or slice of execution time. The quantum has a default value of 25 milliseconds (Windows CE version 2.0 supports changes to the quantum value on MIPS platforms). Threads at a lower priority do not run until all threads with a higher priority have finished, that is, until they either yield or are blocked. An important exception is that threads at the highest priority level (level 0, THREAD_PRIORITY_TIME_CRITICAL) do not share the time slice with other threads at the highest priority level. These threads continue executing until they have finished.
Unlike other Microsoft Windows operating systems, the Windows CE thread priorities are fixed and do not change. Windows CE does not age priorities and does not mask interrupts based on these levels. They can be temporarily modified, but only by the Windows CE kernel, and only to avoid what is known as "priority inversion."
Priority inversion refers to a situation where the use of a resource by a low-priority thread delays the execution of a high-priority thread, when both are contending for the same resources. To correct this situation and free the higher-priority thread, Windows CE allows the lower-priority thread to inherit the more critical thread's priority and run at the higher priority until it releases its use of the resource.
For example, if a thread running at the highest priority attempts to acquire a mutex that is held by a lower priority thread, the lower priority thread is inverted to the higher priority and run until it releases the mutex. The priority inversion algorithm applies to all threads in the system—for example, even kernel threads running at priority 1 can be inverted to level 0 should a priority 0 thread run demand-paged code and cause a blocking fault.
The priority-based preemptive multitasking design of the scheduler ensures that a thread running at the highest priority can be guaranteed to execute within a known period of time. This paper will later discuss specific latency figures for the Windows CE reference platform and the formulas to derive these figures for other platforms. Tools in the OEM Adaptation Kit (OAK) and SDK display the thread's state and priority level, and profile the performance of a given real-time system.
Real-time systems must ensure that processes and threads are synchronized. For example, if one part of the real-time application completes before a second part gets the most current data, the process that the application is monitoring may become unstable. Synchronization ensures that interaction between the application threads occurs correctly.
Like other Windows operating systems, Windows CE offers a rich set of "wait objects" for thread synchronization. These include critical section, event, and mutex objects. These wait objects allow a thread to block its own execution and wait until the specified object changes.
Windows CE queues mutex, critical section, and event requests in "FIFO-by-priority" order: a different first-in, first-out (FIFO) queue is defined for each of the eight discrete priority levels. A new request from a thread at a given priority is placed at the end of that priority's list. The scheduler adjusts these queues when priority inversions occur.
In addition to wait objects, Windows CE supports standard Win32 timer API functions. These use software interrupts from the kernel to obtain time intervals for use in managing real-time applications. Threads can use the system's interval timer by calling GetTickCount, which returns a count of milliseconds. For more detailed timing information, the Windows CE kernel also supports the Win32 API functions QueryPerformanceCounter and QueryPerformanceFrequency. The OEM must provide the hardware and software support for these calls with a higher-resolution timer and OAL interfaces to the timer.
Windows CE provides a virtual memory system. For example, while some current hardware platforms running Windows CE offer 4 megabytes (MB) of physical RAM, Windows CE supports a virtual address space of 2 gigabytes (GB), with each process accessing its own 32 MB of virtual address space. Paging a thread's code or data into physical memory as it is needed generates paging interrupts that can affect the amount of thread execution time.
Paging input/output (I/O) occurs at a lower priority level than the real-time priority process levels. Paging within the real-time process is still free to occur, but this ensures that background virtual memory management won't interfere with processing at real-time priorities.
Real-time threads should be locked into memory to prevent these nondeterministic paging delays that can result from using the virtual memory management system.
Windows CE allows memory mapping which permits multiple processes to share the same physical memory. This results in very fast data transfers between cooperating processes or between a driver and an application. Memory mapping can be used to dramatically enhance real-time performance.
Real-time applications are designed to respond to external events within a specified time interval. Real-time applications use interrupts as a way of ensuring that external events are quickly noticed by the operating system.
Within Windows CE, the kernel and the OAL are tuned to optimize interrupt delivery and event dispatching to the rest of the system. Windows CE balances performance and ease of implementation by splitting interrupt processing into two steps: an interrupt service routine (ISR) and an interrupt service thread (IST).
Each hardware interrupt request line (IRQ) is associated with one ISR. When interrupts are enabled and an interrupt occurs, the kernel calls the registered ISR for that interrupt. The ISR, the kernel-mode portion of interrupt processing, is kept as short as possible. Its responsibility is primarily to direct the kernel to launch the appropriate IST.
The ISR performs its minimal processing and returns an interrupt ID to the kernel. The kernel examines the returned interrupt ID and sets the associated event. The interrupt service thread is waiting on that event. When the kernel sets the event, the IST stops waiting and starts performing its additional interrupt processing. Most of the interrupt handling actually occurs within the IST. The two highest thread priority levels (levels 0 and 1) are usually assigned to ISTs, ensuring that these threads run as quickly as possible.
As noted previously, ISTs at the highest priority level are not preempted by any other threads. These threads continue execution until they either yield or are blocked.
Windows CE does not support nested interrupts. This means that an interrupt can not be serviced while a previous interrupt is being processed. That is, if an interrupt is raised while the kernel is within an ISR, execution continues to the end of that ISR before starting the ISR for the new IRQ. This can introduce latencies, or delays between the time of the hardware interrupt and the start of the ISR.
In this paper, the term interrupt latency refers primarily to the software interrupt handling latencies; that is, the amount of time that elapses from the time that an external interrupt arrives at the processor until the time that the interrupt processing begins.
Windows CE interrupt latency times are bounded for threads locked in memory (when paging does not occur). This makes it possible to calculate the worst-case latencies—the total times to the start of the ISR and to the start of the IST. The total amount of time until the interrupt is handled can then be determined by calculating the amount of time needed within the ISR and IST.
The general formula for ISR latency is defined as follows:
start of ISR = value1 + dISR_Current + sum(dISR_Higher)
value1 = The latency value due to processing within the kernel.
dISR_Current = The duration of an ISR in progress at the time the interrupt arrives. This value can range from 0 to the duration of longest ISR in the system.
sum(dISR_Higher) = The sum of the duration of all higher-priority ISRs that arrive before this ISR starts; that is, interrupts that arrive during the time value1 + dISR_Current.
For example, consider a simple embedded system with a critical-priority ISR. Because the ISR is set to the highest priority, there are no higher priority ISRs, and the value dISR_Higher resolves to 0. The lowest latency, when no other ISRs are in progress, is value1. The highest latency is value1 + the duration of the longest ISR in the system, when the interrupt arrives just as the longest ISR in the system is just starting.
The general formula for IST latency is defined as follows:
start of IST = value2 + sum(dIST) + sum(dISR)
value2 = The latency value due to processing within the kernel.
sum(dIST) = The sum of the duration of all higher priority ISTs and thread context switch times that occur between this ISR and its start of IST.
sum(dISR) = The sum of the duration of all other ISRs that run between this interrupt's ISR and its IST.
For the simplest case—the embedded system with one critical-priority ISR and just one critical-priority thread (and no other priority 0 threads)—no other ISTs can intervene between the ISR and its IST. However, it is possible that other ISRs can be processed between the time-critical ISR and the start of its associated IST.
Because ISRs are processed as soon as they are available, it is possible to imagine pathological cases that involve generating a constant stream of ISRs, indefinitely postponing the start of the IST. This is unlikely to occur because the OEM has complete control over the number of interrupts present in the system. Because the OEM is creating a custom version of Windows CE for a specific operating environment, the OEM can take advantage of constraints within that target operating environment to optimize the system design.
To minimize latency times, the OEM can control the processing times of the ISR and IST, interrupt priorities, and thread priorities. The value1 and value2 elements in the formulas represent processing times within the Windows CE kernel that the OEM cannot control.
The current work on timing studies involves identifying the values for these numbers.
Two different approaches are being used to validate Windows CE performance:
The Windows CE Embedded Toolkit for Visual C++ 5.0 will include the following tools:
Note For more information about the toolkit, see the Microsoft Windows CE Web site (http://www.microsoft.com/windowsce/developer/prodinfo/vcceembed.htm). The toolkit will also be available to Universal Subscribers of the MSDN Library.
Microsoft can also develop other timing-related tools based on customers' needs
The Windows CE development team has inspected the kernel code to verify that it can be characterized by a worst-case time that is independent of the number of system objects.
For the purposes of this inspection, the kernel is characterized as a set of KCALLs, or system calls. These are kernel routines during which the kernel turns off preemption and does not allow other threads to run. The worst-case time that a real-time thread is held from running can be characterized as the worst-case KCALL time in the kernel. (Note that these times do not affect ISRs, but only threads, such as the ISTs.)
The development team has by inspection shown that there are no nonconstant loops through the KCALLs. This means that all KCALLs can be expressed as code paths with only forward branches, and ensures that it is possible to find the worst case time through the KCALL independent of input parameters.
Finding the actual worst-case times involves using the instrumented kernel. This is simply a version of the kernel that is compiled and built after setting a specific environment variable, KCALL_PROFILE=1, to enable additional timing-related functions. The instrumented kernel is not the same as the debug kernel. The instrumented kernel is built as a retail kernel to obtain timing values that are as close as possible to the shipping product. The only difference between the retail kernel and the instrumented kernel is its instrumentation code.
The instrumented kernel records all KCALL times. These values, including minimum, maximum, and average times, can be printed to the debug port by calling the special API function, DumpKCallProfile. The instrumented kernel is typically run under heavy stress and then an application calls DumpKCallProfile to obtain the times.
The interrupt test utility Intrtime.exe collects interrupt-timing-delay data on a standard version of Windows CE. The utility takes over the system timer interrupt during the test, so it is not suitable for use on systems where the timer is required. For example, this utility cannot be used with the instrumented version of the kernel, which also requires the timer for its data.
The Intrtime utility ran a test of 1000 interrupts on the Hitachi D9000 Reference Platform with an SH3 microprocessor, running at 58.98 megahertz (MHz) internal and 14.745 MHz external frequency.(For information about Windows CE runtime characteristics running on other microprocessors, please contact the vendors.)
The tests ran on a standard Handheld/PC configuration built to include all modules and components of Windows CE. Only the main operating system processes were running (Nk.exe, Filesys.exe, Gwes.exe, Device.exe, Shell.exe, and Explorer.exe), with no user-initiated interrupts (touch screen, keyboard, or other applications) during the tests. The utility reported the following minimum and maximum times to Start of ISR and Start of IST.
Table 2. Interrupt test results
Latency | Minimum and maximum values (1000 tests) |
Start of ISR | 1.3–7.5 microseconds |
Start of IST | 93–275 microseconds |
Most of the test results were distributed near the reported minimum values. While testing the start of ISR times, the minimum values 1.3 microseconds and 1.6 microseconds occurred 293 and 549 times, respectively, representing 84 percent of the tests. Similarly, over 90 percent (923 of 1000) of the Start-of-IST tests reported maximum latencies of 102 microseconds or less.
The Intrtime utility also creates a user-configurable number of system objects to test the ISR and IST start times with a variety of system objects. Although this work is very preliminary, it demonstrates that the Start of ISR times are independent of the number of system objects. The tests ran 1000 times (unless otherwise noted) with a background thread priority set to 5 or 7.
Table 3. ISR start times
Start of ISR Maximum | Numbers of background threads (with one event per thread) | Background thread priority |
8.4 | 0 | 7 |
8.6 | 5 (Note: represents only 100 tests) | 7 |
9.0 | 10 (Note: represents only 100 tests) | 5 |
14.8 | 10 | 5 |
19.2 | 10 | 5 |
17.0 | 10 | 7 |
12.8 | 20 | 5 |
11.0 | 20 (Note: represents only 100 tests) | 7 |
10.0 | 50 | 7 |
15.0 | 100 | 5 |
15.6 | 100 | 7 |
The values are not a function of the number of objects in the system. The different values are likely accounted for by the kernel state at the time the interrupt occurred. The development team is currently working on identifying the worst-case Start-of-ISR time.
Working backward from these results, and assuming that the minimum value to Start of ISR represents the best-case scenario, where dISR_Current and sum(dISR_Higher) = 0, the minimum value1 = start of ISR = 1.3 microseconds. Similarly, assuming the best-case scenario where sum(dIST) and sum(dISR) have the value 0, the minimum value2 = Start of IST = 93 microseconds. From these test results alone, however, it is not possible to determine the maximum values for value1 or value2.
Additional timing information can be gleaned from the instrumented kernel. A worst-case value for the amount of time spent in the kernel before Start of IST, value2, can be calculated based on the following formula:
value2 = dKCall + dNextThread
dKCall = The duration of the kernel call; the amount of time spent in one of the sections of the kernel during which preemption is turned off.
dNextThread = The duration of the NextThread kernel call; the amount of time spent setting up the IST.
In reality, the scheduling of a thread at priority level 0 is faster than the general NextThread call, but this formula can simulate the upper bound.
The following table shows the worst-case results during preliminary tests with the instrumented kernel. These tests ran under the same conditions as the Intrtime tests described above.
Table 4. Worst-case results
Kernel call profiled | Maximum values (ad-hoc testing) |
Highest maximum time, all kernel calls | 266 microseconds (LeaveCrit) |
NextThread | 237 microseconds |
Total | 503 microseconds |
The instrumented kernel suggests that the upper bound for value2 in these conditions is about 500 microseconds. This value, the sum of two worst-case times, far exceeds the actual test results reported by the Intrtime utility, and is greater than the actual worst-case time. For example, scheduling a priority 0 thread does not lead to the worst-case path through NextThread. This suggests that the 500-microsecond value represents a conservative upper bound that is higher than the actual worst-case time.
The Intrtime utility is useful for reporting observed worst-case times of the system as a whole. The instrumented kernel is useful for reporting possible worst-case times. It profiles all latencies caused by the kernel—those cases where an IST is ready to run but is blocked while execution continues within a nonpreemptible portion of the kernel. The system worst-case latency can be calculated from the sum of the worst-case times of its parts.
Note that in this preliminary version of this article, all test results are based on an internal beta version of Windows CE. The existing operating system and utilities will continue to be modified and additional testing will profile system performance under a variety of operating conditions. These values will continue to be updated and published to reflect the current shipping version of the operating system.
The Microsoft Windows CE kernel design meets the minimum requirements of an RTOS described here, enabling Windows CE to be used as the operating system in many different types of embedded and real-time systems:
The Windows CE kernel design ensures predictable, bounded latencies for interrupts and their associated threads. This makes it suitable for many real-time applications. Future releases of the Windows CE Embedded Toolkit for Visual C++ will include interrupt timing utilities and an instrumented kernel to allow the OEM to examine and calibrate timing performance.
For the Microsoft Windows CE Embedded Toolkit for Visual C++ 5.0, see the Microsoft Windows CE Web site (http://www.microsoft.com/windowsce/developer/prodinfo/vcceembed.htm). The toolkit will also be available to Universal Subscribers of the MSDN Library.
Also, be sure to take a look at the following articles, all available in the MSDN Library:
"Embedded Development with Microsoft Windows CE 2.0," by Franklin Fite Jr. and Randy Kath.
"Introducing the Windows CE Embedded Toolkit for Visual C++ 5.0," by David Pellerin.
"Microsoft Windows CE Display Drivers and Hardware," by Jason Black and Jon Christiansen.
"Microsoft Windows CE Graphics Features," by Jon Christiansen.
"Microsoft Windows CE Memory Use," by John Murray.
"The Microsoft Windows CE Communications Model," by Guy Smith.
"The Win32 Programming Model: A Primer for Embedded Software Developers," by David Pellerin.