Perfmon Design

Perfmon

Perfmon is a tool that allows user-level code to access the performance counters present in the Ultra-series workstations and servers produced by Sun Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode). For some devices, like the UltraSPARC CPU, accessing the performance counters requires special machine instructions. There is a user library component of Perfmon that provides access to these instructions via C function calls. The library also includes access to other useful functions such as memory/instruction barriers. Currently, the only devices supported are the UltraSPARC-I and the UltraSPARC-II CPUs and will be the only devices discussed in the remainder of this document. See the section on Future Work for devices that may be supported in future versions of Perfmon.

Background

There are two parts to collecting performance data on UltraSPARC CPUs. The first is to program the Performance Control Register (PCR) indicating the type of events that you wish to count. Access to the PCR is always privileged and requires a call into the Perfmon device driver. The second part is to read the Performance Instrumentation Counter (PIC) register to get the current count of the watched events. Access to the PIC is normally privileged, but can be made non-privileged by turning off the lower bit of the PCR register. In addition to the PCR and PIC registers, user access to the UltraSPARC's TICK register (which is incremented once per machine clock cycle at all times) is also enabled. This is done by turning off the upper bit of the TICK register. See the Perfmon User's Guide and the UltraSPARC-I User's Manual for more information.

Design

There were two basic requirements in designing Perfmon. The first was to allow user programs to access the performance counters, which is a privileged operation. The second was to have lightweight access to the accumulated data to minimize the amount of error introduced by the act of reading the performance registers.

Kernel Component

Since access to the PCR is always privileged, and access to the PIC and TICK is by default privileged, it was necessary to write some code that runs in the kernel context. For maximum flexibility and ease of installation, it was decided to write a loadable device driver rather than have a specially modified kernel.

The loadable driver is a standard, autoconfiguring SVR4 character device driver. In addition to the static structures and functions needed to support a device driver (see Writing Device Drivers in AnswerBook for more details), there are two functions in the device driver that needed Perfmon-specific code to be written.

When the device driver is initially loaded, there is a certain sequence that of events that happen:

At this point, the driver is loaded and ready to accept requests from the user. The communication channel that is used between a user program and the device driver is the ioctl() system call. The sequence of events from the driver's point of view is:

If the driver is ever unloaded, there is a sequence of events that take place:

Kernel issues

Earlier during development, I was seeing some cases on MP machines where the driver would load, make the cross-call to turn off the TICK.npt bit, and return. However, when I ran a user program that tried to read TICK, it would crash with an Illegal instruction error, indicating that the TICK.npt bit had not been turned off. If I waited a minute or two, the problem would go away and everything would work perfectly, implying that the cross-calls were working, but taking their time doing it. This was finally resolved by adding calls to xc_attention() and xc_dismissed() around the cross call. These functions basically forces all CPUs into a tight loop, waiting to receive cross-calls, then release them. Since the installation of this code, I have not been able to reproduce my earlier problem, so I'm assuming that it's fixed.

The only tricky ioctl() to implement was PERFMON_FLUSH_CACHE. This causes the cache on the current CPU to be flushed. The actual flushing is done by calling a pre-existing kernel routine (cpu_flush_ecache()) that accesses a region of memory that aliases with each cache line in the CPU. The tricky part was getting access to this routine. Under Solaris 2.6 (where I did my initial development), the cpu_flush_ecache() function is a global kernel symbol, meaning that I can just reference that function in my driver code, and it will be resolved when my driver is loaded. However, under Solaris 2.5.1, this function is not a global symbol and cannot be resolved by the kernel module linker (krtld) at module load time. However, the symbol could be resolved once I was already loaded and running in kernel space. This means that in order to support this function, I need to make calls into krtld to resolve cpu_flush_ecache() myself. Luckily, this turned out to be less complicated than it sounds. The first time that a cache flush is requested by the user, I look up the address of cpu_flush_ecache() by using kobj_getsymvalue(). I then keep a pointer to this function around for later use, along with a flag indicating that I have attempted lookup (since it's possible that the symbol doesn't exist). And to make the driver MT-safe, the lookup has to be protected via a mutex lock to avoid any possible race conditions.

User Component

The user-land component of Perfmon was relatively easy and quick to implement. The library functions are all written in assembly since they need to use special machine instructions to do their work. Also, since most of the performance counter registers are 64-bit, and the Solaris compilers and OS are currently 32-bit, the library routines had to split the 64-bit registers into two separate registers so that the calling C code could deal with them properly.

Another issue for writing the user-level code was the fact that the performance counters are kept on a per-processor basis rather than a per-process basis. This means that if you run your program on an MP machine, and it migrates between CPUs during its run-time (which is pretty likely given Solaris' work-grabbing scheduler), the data read from the CPU performance counters is useless. Fortunately, there is a non-privileged system call named processor_bind() that will let you bind your process (or a single LWP) to a particular CPU.

Also, since there is some setup required by most programs using Perfmon, a skeleton program was provided to minimize development and testing of programs. The basic outline of the skeleton program is:

Interaction

The interaction between user code and driver code of a typical user program using Perfmon would go something like this:

            User Code             |          Kernel Code
       C             assembly     |      C                assembly
----------------------------------+----------------------------------
open("/dev/perfmon")              | perfmon_open()                   
                                  |                                  
ioctl(PERFMON_SETPCR)             | perfmon_ioctl()                   
                                  | copyin()                         
                                  |                       pm_set_pcr()
                                  |                                  
ioctl(PERFMON_FLUSH_CACHE)        | perfmon_ioctl()                  
                                  | cpu_flush_ecache()               
                                  |                                  
gethrtime()                       |                                  
                    clr_pic()     |                                  
                    cpu_sync()    |                                  
                    get_pic()     |                                  
                                  |                                  
Run code to be analyzed           |                                  
                                  |                                  
                    cpu_sync()    |                                  
                    get_pic()     |                                  
gethrtime()                       |                                  
                                  |                                  
Analyze results                   |                                  
                                  |                                  
close("/dev/perfmon")             | perfmon_close()                  
                                  |                                  

Installation

Since Perfmon requires the installation of a device driver on every machine that is to run it, the installation procedure was designed to be as easy as possible for the system administrator.

Solaris supports the installation of a collection of files through a mechanism called packages. Each package consists of a collection of files to be installed, optional scripts that are run before and after installation/removal, and two files that are used by Solaris to identify the package and its components.

When the system administrator wishes to add Perfmon to a machine, they only need to type the command pkgadd -d MSUperf and answer "yes" to two questions. This causes the following sequence of events to occur:

At this point, the Perfmon driver is installed, loaded, and ready for use. If the system administrator ever wishes to remove Perfmon from the system, all they must do is execute the command pkgrm MSUperf and all of the above steps are undone.

Future work

There are plenty of things that can be done to extend the features and usefulness of Perfmon. Some of the items that are planned for the future are: