Perfmon for the Pentium Pro is a tool that allows user-level code to access the performance counters present in the Intel® P6 family of microprocessors running the Solaris® operating system by Sun® Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode). Accessing the performance counters requires special machine instructions. There is a user library component of Perfmon that provides access to these instructions via C function calls. The library also includes a function for accessing the Time-Stamp-Counter (TSC) that is incremented every processor clock tick. Currently, the only devices supported are the Pentium Pro and the Pentium II processors. See the section on Future Work for devices that may be supported in future versions of Perfmon.
There are two parts to collecting performance data on the P6 CPU. The first is to program one or both of the Performance-Event-Select registers (PerfEvtSel0/1) indicating the type of events that you wish to count. Access to the PerfEvtSel0/1 is always privileged and requires a call into the Perfmon device driver. The second part is to read one or both of the Performance Counter (PerfCtr0/1) registers to get the current count of the watched events. Access to the PerfCtr0/1 is normally privileged, but can be made non-privileged by setting the PCE bit (bit 8) of the CR4 register. (Control Register 4). In addition , user access to the TSC register (which is incremented once per machine clock cycle at all times) is also enabled. This is done by turning off the TSD bit of the CR4 register. See the Perfmon For the Pentium Pro User's Guide and the the Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide for more information.
There were two basic requirements in designing Perfmon. The first was to allow user programs to access the performance counters, which is a privileged operation. The second was to have lightweight access to the accumulated data to minimize the amount of error introduced by the act of reading the performance registers as well as the TSC register.
Since access to the PerfEvtSel0/1 is always privileged, and access to the PerfCtr0/1 and TSC is by default privileged, it was necessary to write some code that runs in the kernel context. For maximum flexibility and ease of installation, it was decided to write a loadable device driver rather than have a specially modified kernel.
The loadable driver is a standard, autoconfiguring SVR4 character device driver. In addition to the static structures and functions needed to support a device driver (see Writing Device Drivers in AnswerBook for more details), there are two functions in the device driver that needed Perfmon-specific code to be written.
When the device driver is initially loaded, there is a certain sequence that of events that happen:
At this point, the driver is loaded and ready to accept requests from the user. The communication channel that is used between a user program and the device driver is the ioctl() system call. The sequence of events from the driver's point of view is:
/dev/perfmon. This ends up calling perfmon_open(), which the kernel locates through the static cb_ops structure. In the case of the Perfmon device driver, there is no special permission checking or state information that needs to be taken care of upon an open(), so perfmon_open() always returns successfully.
an ioctl() call with this command invokes the pm_tsc_perfctrs() call which sets theappropriate bits in the CR4 register to enable non-privileged access to the TSC and the PerfCtr0/1 registers.
PERFEVTSEL0_W: The user passes in a pointer to a 32-bit value that is to be stored in the PerfEvtSel0 register. But since the pointer is to a user address, and is not valid in kernel space, we need to call copyin() to map the user's buffer into the kernel so that we can get at the value. After we have the value, we simply call pm_set_perfevtsel0() which sets the value of the PerfEvtSel0 register.
PERFEVTSEL0_R: The user passes in a pointer to a 32-bit buffer in which the current value of the PerEvtSel0 register is to be stored. We have the same memory mapping problem as before, so we call pm_get_perfevtsel0() to get the current PerfEvtSel0 value and then store it in the user's buffer using copyout().
WBINVD_CACHES: This command invokes the function pm_wbinvd() which simply issues the wbind instruction. This instruction can only executed in privileged code and causes all external caches to be written back to memory then invalidated.
If the driver is ever unloaded, there is a sequence of events that take place:
The library functions are all written in assembly since they need to use special machine instructions to do their work. Both of the read_perfctr0 () and read_perfctr1() calls as well as the read_tsc() call return a 64-bit value. Calling these functions can only be done after calling the ioctl() call with the TSC_PERFCTRS_EN command. The read_perfctr0/1() calls simply issue a RDPMC instruction to read the appropriate counter. The read_tsc() function issues a RDTSC instruction.
Also a cpu_serialize() call has been implemented. This call issues a serializing instruction, specifically CPUID, so that all modification to flags, registers, and memory by previous instructions are completed before the next instruction is fetched and executed and all buffered writes have been drained to memory. This call, however is not necessary since serializing is implemented in the read_perfctr0/1 () calls by using a CPUID instruction. All the other calls via ioctl() to the driver are serialized by nature since using the privileged instructions (RDMSR, WRMSR) are serializing instructions, and therefore, the user needs not to use the cpu_serialize() call except for possible special cases.
Another issue for writing the user-level code was the fact that the performance counters are kept on a per-processor basis rather than a per-process basis. This means that if you run your program on an MP machine, and it migrates between CPUs during its run-time (which is pretty likely given Solaris' work-grabbing scheduler), the data read from the CPU performance counters is useless. Fortunately, there is a non-privileged system call named processor_bind() that will let you bind your process (or a single LWP) to a particular CPU.
Also, since there is some setup required by most programs using Perfmon, a skeleton program was provided to minimize development and testing of programs. The basic outline of the skeleton program is:
The interaction between user code and driver code of a typical user program using Perfmon would go something like this:
User Code | Kernel Code C assembly | C assembly ----------------------------------+------------------------------------------------ open("/dev/perfmon") | perfmon_open() ioctl(TSC_PERFCTRS_EN) | perfmon_ioctl() | pm_tsc_perfctrs_enable() ioctl(PERFEVTSEL0_W) | perfmon_ioctl() | copyin() | pm_set_perfevtsel0() | ioctl(WBINVD_CACHES) | perfmon_ioctl() | pm_wbinvd() ioctl(PERFCTR0_W) | perfmon_ioctl() gethrtime() | pm_set_perfctr0() ioctl(STARTPERFCTRS) | perfmon_ioctl() | pm_start_perfctrs() | Run code to be analyzed | | | ioctl(STOPPERFCTRS) | perfmon_ioctl() | pm_stop_perfctrs() read_perfctr0() | | gethrtime() | Analyze results | close("/dev/perfmon") | perfmon_close() |
Since Perfmon requires the installation of a device driver on every machine that is to run it, the installation procedure was scripted so it can be used by the system administrator. Running the script does the following:
/etc/devlink.tabis examined and any existing entries relating to Perfmon are deleted.
add_drvcommand. This causes a major device number to be picked by Solaris and registered in the file
/etc/name_to_major. The file
/etc/minor_permis updated to reflect the desired permission on the Perfmon device node when it gets created.
/etc/devlink.tabgets the entry for the Perfmon device driver added to it. This will cause a symbolic link to be created from
/dev/perfmonto the actual device node, which usually resides at
/devices/pseudo/perfmon@0:perfmon. This is just for convenience to the user.
drvconfig -i perfmon.
/etc/devlink.tab. This is forced by running the program
At this point, the Perfmon driver is installed, loaded, and ready for use.