Perfmon For The Pentim Pro Design

Perfmon

Perfmon for the Pentium Pro is a tool that allows user-level code to access the performance counters present in the Intel P6 family of microprocessors running the Solaris operating system by Sun Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode). Accessing the performance counters requires special machine instructions. There is a user library component of Perfmon that provides access to these instructions via C function calls. The library also includes a function for accessing the Time-Stamp-Counter (TSC) that is incremented every processor clock tick. Currently, the only devices supported are the Pentium Pro and the Pentium II processors. See the section on Future Work for devices that may be supported in future versions of Perfmon.

Background

There are two parts to collecting performance data on the P6 CPU. The first is to program one or both of the Performance-Event-Select registers (PerfEvtSel0/1) indicating the type of events that you wish to count. Access to the PerfEvtSel0/1 is always privileged and requires a call into the Perfmon device driver. The second part is to read one or both of the Performance Counter (PerfCtr0/1) registers to get the current count of the watched events. Access to the PerfCtr0/1 is normally privileged, but can be made non-privileged by setting the PCE bit (bit 8) of the CR4 register. (Control Register 4). In addition , user access to the TSC register (which is incremented once per machine clock cycle at all times) is also enabled. This is done by turning off the TSD bit of the CR4 register. See the Perfmon For the Pentium Pro User's Guide and the the Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide for more information.

Design

There were two basic requirements in designing Perfmon. The first was to allow user programs to access the performance counters, which is a privileged operation. The second was to have lightweight access to the accumulated data to minimize the amount of error introduced by the act of reading the performance registers as well as the TSC register.

Kernel Component

Since access to the PerfEvtSel0/1 is always privileged, and access to the PerfCtr0/1 and TSC is by default privileged, it was necessary to write some code that runs in the kernel context. For maximum flexibility and ease of installation, it was decided to write a loadable device driver rather than have a specially modified kernel.

The loadable driver is a standard, autoconfiguring SVR4 character device driver. In addition to the static structures and functions needed to support a device driver (see Writing Device Drivers in AnswerBook for more details), there are two functions in the device driver that needed Perfmon-specific code to be written.

When the device driver is initially loaded, there is a certain sequence that of events that happen:

At this point, the driver is loaded and ready to accept requests from the user. The communication channel that is used between a user program and the device driver is the ioctl() system call. The sequence of events from the driver's point of view is:

If the driver is ever unloaded, there is a sequence of events that take place:

User Component

The library functions are all written in assembly since they need to use special machine instructions to do their work. Both of the read_perfctr0 () and read_perfctr1() calls as well as the read_tsc() call return a 64-bit value. Calling these functions can only be done after calling the ioctl() call with the TSC_PERFCTRS_EN command. The read_perfctr0/1() calls simply issue a RDPMC instruction to read the appropriate counter. The read_tsc() function issues a RDTSC instruction.

Also a cpu_serialize() call has been implemented. This call issues a serializing instruction, specifically CPUID, so that all modification to flags, registers, and memory by previous instructions are completed before the next instruction is fetched and executed and all buffered writes have been drained to memory. This call, however is not necessary since serializing is implemented in the read_perfctr0/1 () calls by using a CPUID instruction. All the other calls via ioctl() to the driver are serialized by nature since using the privileged instructions (RDMSR, WRMSR) are serializing instructions, and therefore, the user needs not to use the cpu_serialize() call except for possible special cases.

Another issue for writing the user-level code was the fact that the performance counters are kept on a per-processor basis rather than a per-process basis. This means that if you run your program on an MP machine, and it migrates between CPUs during its run-time (which is pretty likely given Solaris' work-grabbing scheduler), the data read from the CPU performance counters is useless. Fortunately, there is a non-privileged system call named processor_bind() that will let you bind your process (or a single LWP) to a particular CPU.

Also, since there is some setup required by most programs using Perfmon, a skeleton program was provided to minimize development and testing of programs. The basic outline of the skeleton program is:

Interaction

The interaction between user code and driver code of a typical user program using Perfmon would go something like this:

            User Code             |             Kernel Code
       C             assembly     |      C                     assembly
----------------------------------+------------------------------------------------
open("/dev/perfmon")              | perfmon_open()                   
ioctl(TSC_PERFCTRS_EN)            | perfmon_ioctl()                  
                                  |                       pm_tsc_perfctrs_enable()
ioctl(PERFEVTSEL0_W)              | perfmon_ioctl()                   
                                  | copyin()                         
                                  |                       pm_set_perfevtsel0()
                                  |                                  
ioctl(WBINVD_CACHES)              | perfmon_ioctl()                  
                                  |                       pm_wbinvd()               
ioctl(PERFCTR0_W)                 | perfmon_ioctl()                                 
gethrtime()                       |                       pm_set_perfctr0()         
ioctl(STARTPERFCTRS)              | perfmon_ioctl()                     
                                  |                       pm_start_perfctrs()      
                                  |                                  
Run code to be analyzed           |                                  
                                  |                                  
                                  |                                  
ioctl(STOPPERFCTRS)               | perfmon_ioctl()                  
                                  |                       pm_stop_perfctrs()
                  read_perfctr0() |                                  
                                  |                                  
gethrtime()                       |                                  
Analyze results                   |                                  
close("/dev/perfmon")             | perfmon_close()                  
                                  |                                  

Installation

Since Perfmon requires the installation of a device driver on every machine that is to run it, the installation procedure was scripted so it can be used by the system administrator. Running the script does the following:

At this point, the Perfmon driver is installed, loaded, and ready for use.

Future work: