Perfmon for the Pentium Pro is a tool for monitoring the performance of the Intel Pentium Pro processor and the Intel P6 family of processors. It is a device driver that runs with the Sun Solaris® operating system kernel. It allows user level or non-privileged code to read and write to the performance-monitoring counters on the processor. This tool can be used in refining hardware designs, optimizing code, or diagnosing some system failures. It can also be used as a research tool for cross-platform or cross-operating-system benchmarking and comparisons. For an in-depth look at perfmon design please refer to the Perfmon for the Pentium Pro Design document.
The following processors are supported.
Currently only the Solaris® operating system is supported.
There is also Perfmon support for the UltraSPARC-I and UltraSPARC-II CPU's running the Solaris operating system. This perfmon can be obtained from the following link: http://www.cse.msu.edu/~enbody/perfmon/perfmon.html.
The driver source code, binary and documentation, including this document, are "tarred" in the file perfmon.tar. After downloading the tar file, uncompressing it and untarring it in a directory, the following directories should be created:
An installation script, perfinstall, has been provided to install the perfmon device driver. This script has to be run with an administrator account. The script does the following:
The perfinstall script should be run from the directory where perfmon.tar has been expanded:
%<become root> %cd <perfmon directory> %perfinstall
Note that Perfmon will only work on the Intel Pentium Pro or the Intel P6 based machines.
In order to write programs using Perfmon, you will need
to have access to perfmon.h (found in the include
directory), and libperfmon.a
found in the lib directory. Perfmon.h contains all the commands
needed to control the performance counters and read the
time-stamp-counter (TSC). libperfmon.a
contains the actual library routines that can be run in user
level mode.
Note that code written using Perfmon will only run on machines where the Perfmon driver is installed and loaded. Attempting to run programs using Perfmon on other machines may result in strange behavior, core dumps, illegal instruction errors, etc.
There are two types of Perfmon routines:
The library routines can be run inside of a user application
as you would expect. The driver routines need to be accessed via ioctl().
To use ioctl(), you must first open the Perfmon
device (accessible through /dev/perfmon). After the
device is open, you simply use ioctl() to
communicate to the driver what routine you wish to run (passing
arguments as necessary). Here is an example code segment which
opens the device and issues a write back and invalidate external
caches request on the current CPU:
#include <stdio.h>
#include <fcntl.h>
#include "perfmon.h"
main()
{
int fd;
int rc;
fd = open("/dev/perfmon", O_RDONLY);
if (fd == -1) {
perror("open(/dev/perfmon)");
exit(1);
}
/* Enable reading of the TSC register and PerfCtrs */
rc = ioctl(fd, TSC_PERFCTRS_EN);
if (rc < 0) {
perror("ioctl(TSC_PERFCTRS_EN)");
exit(1);
}
/* Write back and invalidate all external caches of the current CPU */
rc = ioctl(fd, WBINVD_CACHES);
if (rc < 0) {
perror("ioctl(PERFMON_FLUSH_CACHE)");
exit(1);
}
}
The Pentium Pro processor has two 40-bit performance counters, allowing two types of events to be monitored simultaneously. These counters can either count events or measure duration. When counting events, a counter is incremented each time a specified event takes place or a specified number of events takes place. When measuring duration, a counter counts the number of processor clocks that occur while a specified condition is true. The counters can count events or measure duration that occur at any privilege level.
The performance monitoring counters are supported by four Model Specific Registers (MSR's): the performance event select registers (PerfEvtSel0 and PerfEvtSel1) , and the performance counter MSR's (PerfCtr0 and PerfCtr1). These registers reflect events that happen on a per-processor basis. For best results, it is recommended that you run your program on an MP machine and bind your process to a specific CPU to prevent process migration to another CPU and the loss of performance data that has been collected.
Access to the PerfEvtSel0/1 registers is privileged. They can
only be accessed by using the PERFEVTSEL0/1_W and PERFEVTSEL0/1_R
ioctl() routines in the Perfmon driver (see the
next section for more details). Each of the PerfEvtSel0 and
PerfEvtSel1 has the following bitfields (taken from section
10.6.1 Vol 3 of the Pentium Pro Family Developer's Manual):
| Name | Bits | Description |
|---|---|---|
| Event Select | 0-7 | Select the event to be monitored (see next section for a list of events) |
| Unit mask field | 8-15 | Further qualifies the event selected in the event select field. For example for cache events, the mask is used as MESI-protocol qualifier of cache states. |
| User mode flag | 16 | Events are counted only when the processor is operating at privilege level 1, 2 or 3. This flag ca be used in conjunction with the OS flag. |
| OS flag | 17 | Events are counted only when the processor is operating at privilege level 0. This flag can be used in conjunction with the User mode flag. |
| E (Edge detect) flag | 18 | 1=Occurrence 0=Duration |
| PC(pin control)flag | 19 | Enables the signaling of performance counter overflow via BP0 pin |
| INT (APIC interrupt enable) flag | 20 | Enables the signaling of counter overflow via input to APIC, 1=Enable, 0=Disable. |
| EN (Enable Counters) flag | 22 | This flag is only present in the PerfEvtSel0 MSR. When set performance counting is enabled in both performance-monitoring counters; when clear, both counters are disabled. |
| INV (invert) flag | 23 | Inverts the result of the counter-mask comparison when set, so that both greater than and less than comparisons can be made. |
| Counter mask field | 24-31 | When non-zero, the processor compares this mask to the number of count events during a single cycle. If the event count is greater than or equal to this mask, the counter is incremented by one. Otherwise the counter is not incremented. This mask can be used to count events only if multiple occurrences happen per clock ( e.g. two or more instructions retired per clock). If the counter-mask field is 0, then the counter is incremented each cycle by the number of events that occurred that cycle. |
All other bitfields are reserved and should be set to 0.
The performance-counter MSR's (PerfCtr0 and PerfCtr1) contain the event or duration count for the selected events being counted. Writing to these counters can be only done at privilege level 0 and is accomplished via ioctl() calls to the perfmon device driver. The perfmon also provides ioctl() call to make reading of the PerfCtr0/1 and the TSC register non-privileged. Reading the PerfCtr0/1 and the TSC register is done via perfmon library calls.
The include file perfmon.h has encodings for the
various fields in the PerfCtr0/1 registers. These encodings are
pre-shifted so that when they are inclusive-ORed together, they
produce a value suitable for writing directly to the PerfCtr0/1
register. The defined values are:
| Event Name | Unit | Unit Mask |
| EVT_DATA_MEM_REFS | Data Cache Unit (DCU) | 00H |
| EVT_DCU_LINES_IN | 00H | |
| EVT_DCU_M_LINES_IN | 00H | |
| EVT_DCU_M_LINES_OUT | 00H | |
| EVT_DCU_MISS_OUTSTANDING | 00H | |
| EVT_IFU_IFETCH | Instruction Fetch Unit (IFU) | 00H |
| EVT_IFU_IFETCH_MISS | 00H | |
| EVT_ITLB_MISS | 00H | |
| EVT_IFU_MEM_STALL | 00H | |
| EVT_ILD_STALL | 00H | |
| EVT_L2_IFETCH | L2 Cache | MESI 0FH |
| EVT_L2_LD | MESI 0FH | |
| EVT_L2_ST | MESI 0FH | |
| EVT_L2_LINES_IN | 00H | |
| EVT_L2_LINES_OUT | 00H | |
| EVT_L2_M_LINES_INM | 00H | |
| EVT_L2_M_LINES_OUTM | 00H | |
| EVT_L2_RQSTS | MESI 0FH | |
| EVT_L2_ADS | 00H | |
| EVT_L2_DBUS_BUSY | 00H | |
| EVT_L2_DBUS_BUSY_RD | 00H | |
| EVT_BUS_DRDY_CLOCKS | External Bus Logic (EBL) | 00H(Self) 20H(Any) |
| EVT_BUS_LOCK_CLOCKS | 00H(Self) 20H(Any) | |
| EVT_BUS_REQ_OUTSTANDING | 00H (Self) | |
| EVT_BUS_TRAN_BRD | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_RFO | 00H (Self) 20H(Any) | |
| EVT_BUS_TRANS_WB | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_IFETCH | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_INVAL | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_PWR | 00H (Self) 20H(Any) | |
| EVT_BUS_TRANS_P | 00H (Self) 20H(Any) | |
| EVT_BUS_TRANS_IO | 00H (Self) 20H(Any) | |
| EVT_BUS_TRANS_DEF | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_BURST | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_ANY | 00H (Self) 20H(Any) | |
| EVT_BUS_TRAN_MEM | 00H (Self) 20H(Any) | |
| EVT_BUS_DATA_RCV | 00H (Self) | |
| EVT_BUS_BNR_DRV | 00H (Self) | |
| EVT_BUS_HIT_DRV | 00H (Self) | |
| EVT_BUS_HITM_DRV | 00H (Self) | |
| EVT_BUS_SNOOP_STALL | 00H (Self) | |
| EVT_FLOPS | Floating Point Unit | 00H |
| EVT_FP_COMP_OPS_EXE | 00H | |
| EVT_FP_ASSIST | 00H | |
| EVT_MUL | 00H | |
| EVT_DIV | 00H | |
| EVT_CYCLES_DIV_BUSY | 00H | |
| EVT_LD_BLOCKS | Memory Ordering | 00H |
| EVT_SB_DRAINS | 00H | |
| EVT_MISALIGN_MEM_REF | 00H | |
| EVT_INST_RETIRED | Instruction Decoding and Retirement | 00H |
| EVT_UOPS_RETIRED | 00H | |
| EVT_INST_DECODER | 00H | |
| EVT_HW_INT_RX | Interrupts | 00H |
| EVT_CYCLES_INT_MASKED | 00H | |
| EVT_CYCLES_INT_PENDING_AND_MASKED | 00H | |
| EVT_BR_INST_RETIRED | Branches | 00H |
| EVT_BR_MISS_PRED_RETIRED | 00H | |
| EVT_BR_TAKEN_RETIRED | 00H | |
| EVT_BR_MISS_PRED_TAKEN_RET | 00H | |
| EVT_BR_INST_DECODED | 00H | |
| EVT_BTB_MISSES | 00H | |
| EVT_BR_BOGUS | 00H | |
| EVT_BACLEARS | 00H | |
| EVT_RESOURCE_STALLS | Stalls | 00H |
| EVT_PARTIAL_RAT_STALLS | 00H | |
| EVT_SEGMENT_REG_LOADS | Segment Register Loads | 00H |
| EVT_CPU_CLK_UNHALTED | Clocks | 00H |
For example, if you wanted to accumulate the number of
instructions executed in PerfCtr0 in user-mode only, you would
use: EVT_INST_RETIRED | USER_
MODE. This value would then be passed to the PERFEVTSEL0_W
ioctl(). For more description of the above events
please refer to Appendix A of the
Intel Architecture Software Developer's Manual, Volume 3: System
Programming Guide . This is a pdf file that is included with
the perfmon documentation.
ioctl() Routines| Command | Arguments | Description |
|---|---|---|
PERFEVTSEL0_W |
Address of 32-bit buffer (unsigned long) | Sets the value of the PerfEvtSel0 register to the value contained in the passed-in buffer. |
PERFEVTSEL1_W |
Address of 32-bit buffer (unsigned long) | Sets the value of the PerfEvtSel1 register to the value contained in the passed-in buffer. |
| PERFEVTSEL0_R | Address of 32-bit buffer (unsigned long) | Returns the value of PerfEvtSel0 register in the passed-in buffer. |
| PERFEVTSEL1_R | Address of 32-bit buffer (unsigned long) | Returns the Value of PerfEvtSel1 register in the passed-in buffer. |
STARTPERFCTRS |
None | Enables performance counting in both PerfCtr0 and PerfCtr1 |
| STOPPERFCTRS | None | Disables performance counting in both PerfCtr0 and PerfCtr1 |
| STOPERFCTR0 | None | Disables performance counting in PerfCtr0 only. |
| STOPERFCT1 | None | Disables performance counting in PerfCtr1 only. |
| PERFCTR0_W | Address of 32-bit buffer (unsigned long long) | Sets the value of the PerfCtr0 register to the value contained in the passed-in buffer. Bits 32-39 are sign-extended from bit-31 |
| PERFCTR1_W | Address of 32-bit buffer (unsigned long long) | Sets the value of the PerfCtr1 register to the value contained in the passed-in buffer. Bits 32-39 are sign-extended from bit-31 |
| TSC_PERFCTRS_EN | None | This command invokes the perfmon routine allowing unprivileged access to the TSC and the performance-counting registers. |
| WBINVD_CACHES | None | This command invokes the perfmon routine that can issue a "write-back and invalidate" instruction. All external caches are invalidated and written back. |
These library routines are prototyped in perfmon.h
and can be included in your code by adding the compile time
options: -L$PERFMON_HOME/lib -lperfmon. These
routines are in the table below:
| Prototype | Description |
|---|---|
unsigned long long read_tsc(void) |
Returns a 64-bit value of the current Time-Stamp-Counter (TSC) register |
unsigned long long
read_perfctr0(void) |
Returns a 64-bit value of the current PerfCtr0 register. |
| unsigned long long read_perfctr1(void) | Returns a 64-bit value of the current PerfCtr1 register. |
| void cpu_serialize(void) | Issues a cpuid instruction to serialize all proceeding instructions. |
Here is an example to compile a user source code, tick.c, and how to link it with the library routines and using the GNU gcc compiler:.
# Compile using the library
routines:
gcc -I$(PERFMONHOME)/include -o tick -O tick.c
-L$(PERFMON_HOME)/lib -lperfmon
In the examples directory, there are several example programs that show how to use Perfmon:
#ifdef in the source. Read the
source for more details.