Perfmon For The Pentium Pro User's Guide

Introduction

Perfmon for the Pentium Pro is a tool for monitoring the performance of the Intel Pentium Pro processor and the Intel P6 family of processors. It is a device driver that runs with the Sun Solaris operating system kernel. It allows user level or non-privileged code to read and write to the performance-monitoring counters on the processor. This tool can be used in refining hardware designs, optimizing code, or diagnosing some system failures. It can also be used as a research tool for cross-platform or cross-operating-system benchmarking and comparisons. For an in-depth look at perfmon design please refer to the Perfmon for the Pentium Pro Design document.

Supported Processors and Operating Systems

The following processors are supported.

Currently only the Solaris operating system is supported.

There is also Perfmon support for the UltraSPARC-I and UltraSPARC-II CPU's running the Solaris operating system. This perfmon can be obtained from the following link: http://www.cse.msu.edu/~enbody/perfmon/perfmon.html.

Installation

The driver source code, binary and documentation, including this document, are "tarred" in the file perfmon.tar. After downloading the tar file, uncompressing it and untarring it in a directory, the following directories should be created:

An installation script, perfinstall, has been provided to install the perfmon device driver. This script has to be run with an administrator account. The script does the following:

The perfinstall script should be run from the directory where perfmon.tar has been expanded:

%<become root>
%cd <perfmon directory>
%perfinstall

Note that Perfmon will only work on the Intel Pentium Pro or the Intel P6 based machines.

Usage

Requirements

In order to write programs using Perfmon, you will need to have access to perfmon.h (found in the include directory), and libperfmon.a found in the lib directory. Perfmon.h contains all the commands needed to control the performance counters and read the time-stamp-counter (TSC). libperfmon.a contains the actual library routines that can be run in user level mode.

Note that code written using Perfmon will only run on machines where the Perfmon driver is installed and loaded. Attempting to run programs using Perfmon on other machines may result in strange behavior, core dumps, illegal instruction errors, etc.

Writing code

There are two types of Perfmon routines:

The library routines can be run inside of a user application as you would expect. The driver routines need to be accessed via ioctl(). To use ioctl(), you must first open the Perfmon device (accessible through /dev/perfmon). After the device is open, you simply use ioctl() to communicate to the driver what routine you wish to run (passing arguments as necessary). Here is an example code segment which opens the device and issues a write back and invalidate external caches request on the current CPU:

	#include <stdio.h>
	#include <fcntl.h>
	#include "perfmon.h"

	main()
	{
	    int fd;
	    int rc;

	    fd = open("/dev/perfmon", O_RDONLY);
	    if (fd == -1) {
		perror("open(/dev/perfmon)");
		exit(1);
	    }
	    /* Enable reading of the TSC register and PerfCtrs */
	    rc = ioctl(fd, TSC_PERFCTRS_EN);
	    if (rc < 0) {
        	perror("ioctl(TSC_PERFCTRS_EN)");
                 exit(1);
            } 

	    /* Write back and invalidate all external caches of the current CPU */
	    rc = ioctl(fd, WBINVD_CACHES);
	    if (rc < 0) {
		perror("ioctl(PERFMON_FLUSH_CACHE)");
		exit(1);
	    }
	}

Pentium Pro Performance Monitoring Counters

The Pentium Pro processor has two 40-bit performance counters, allowing two types of events to be monitored simultaneously. These counters can either count events or measure duration. When counting events, a counter is incremented each time a specified event takes place or a specified number of events takes place. When measuring duration, a counter counts the number of processor clocks that occur while a specified condition is true. The counters can count events or measure duration that occur at any privilege level.

The performance monitoring counters are supported by four Model Specific Registers (MSR's): the performance event select registers (PerfEvtSel0 and PerfEvtSel1) , and the performance counter MSR's (PerfCtr0 and PerfCtr1). These registers reflect events that happen on a per-processor basis. For best results, it is recommended that you run your program on an MP machine and bind your process to a specific CPU to prevent process migration to another CPU and the loss of performance data that has been collected.

Access to the PerfEvtSel0/1 registers is privileged. They can only be accessed by using the PERFEVTSEL0/1_W and PERFEVTSEL0/1_R ioctl() routines in the Perfmon driver (see the next section for more details). Each of the PerfEvtSel0 and PerfEvtSel1 has the following bitfields (taken from section 10.6.1 Vol 3 of the Pentium Pro Family Developer's Manual):

Name Bits Description
Event Select 0-7 Select the event to be monitored (see next section for a list of events)
Unit mask field 8-15 Further qualifies the event selected in the event select field. For example for cache events, the mask is used as MESI-protocol qualifier of cache states.
User mode flag 16 Events are counted only when the processor is operating at privilege level 1, 2 or 3. This flag ca be used in conjunction with the OS flag.
OS flag 17 Events are counted only when the processor is operating at privilege level 0. This flag can be used in conjunction with the User mode flag.
E (Edge detect) flag 18 1=Occurrence
0=Duration
PC(pin control)flag 19 Enables the signaling of performance counter overflow via BP0 pin
INT (APIC interrupt enable) flag 20 Enables the signaling of counter overflow via input to APIC, 1=Enable, 0=Disable.
EN (Enable Counters) flag 22 This flag is only present in the PerfEvtSel0 MSR. When set performance counting is enabled in both performance-monitoring counters; when clear, both counters are disabled.
INV (invert) flag 23 Inverts the result of the counter-mask comparison when set, so that both greater than and less than comparisons can be made.
Counter mask field 24-31 When non-zero, the processor compares this mask to the number of count events during a single cycle. If the event count is greater than or equal to this mask, the counter is incremented by one. Otherwise the counter is not incremented. This mask can be used to count events only if multiple occurrences happen per clock ( e.g. two or more instructions retired per clock). If the counter-mask field is 0, then the counter is incremented each cycle by the number of events that occurred that cycle.
PerfEvtSel0 and PerfEvtSel1 Registers.

All other bitfields are reserved and should be set to 0.

The performance-counter MSR's (PerfCtr0 and PerfCtr1) contain the event or duration count for the selected events being counted. Writing to these counters can be only done at privilege level 0 and is accomplished via ioctl() calls to the perfmon device driver. The perfmon also provides ioctl() call to make reading of the PerfCtr0/1 and the TSC register non-privileged. Reading the PerfCtr0/1 and the TSC register is done via perfmon library calls.

The include file perfmon.h has encodings for the various fields in the PerfCtr0/1 registers. These encodings are pre-shifted so that when they are inclusive-ORed together, they produce a value suitable for writing directly to the PerfCtr0/1 register. The defined values are:

Event Name Unit Unit Mask
EVT_DATA_MEM_REFS Data Cache Unit (DCU) 00H
EVT_DCU_LINES_IN   00H
EVT_DCU_M_LINES_IN   00H
EVT_DCU_M_LINES_OUT   00H
EVT_DCU_MISS_OUTSTANDING   00H
EVT_IFU_IFETCH Instruction Fetch Unit (IFU) 00H
EVT_IFU_IFETCH_MISS   00H
EVT_ITLB_MISS   00H
EVT_IFU_MEM_STALL   00H
EVT_ILD_STALL   00H
EVT_L2_IFETCH L2 Cache MESI 0FH
EVT_L2_LD   MESI 0FH
EVT_L2_ST   MESI 0FH
EVT_L2_LINES_IN   00H
EVT_L2_LINES_OUT   00H
EVT_L2_M_LINES_INM   00H
EVT_L2_M_LINES_OUTM   00H
EVT_L2_RQSTS   MESI 0FH
EVT_L2_ADS   00H
EVT_L2_DBUS_BUSY   00H
EVT_L2_DBUS_BUSY_RD   00H
EVT_BUS_DRDY_CLOCKS External Bus Logic (EBL) 00H(Self) 20H(Any)
EVT_BUS_LOCK_CLOCKS   00H(Self) 20H(Any)
EVT_BUS_REQ_OUTSTANDING   00H (Self)
EVT_BUS_TRAN_BRD   00H (Self) 20H(Any)
EVT_BUS_TRAN_RFO   00H (Self) 20H(Any)
EVT_BUS_TRANS_WB   00H (Self) 20H(Any)
EVT_BUS_TRAN_IFETCH   00H (Self) 20H(Any)
EVT_BUS_TRAN_INVAL   00H (Self) 20H(Any)
EVT_BUS_TRAN_PWR   00H (Self) 20H(Any)
EVT_BUS_TRANS_P   00H (Self) 20H(Any)
EVT_BUS_TRANS_IO   00H (Self) 20H(Any)
EVT_BUS_TRANS_DEF   00H (Self) 20H(Any)
EVT_BUS_TRAN_BURST   00H (Self) 20H(Any)
EVT_BUS_TRAN_ANY   00H (Self) 20H(Any)
EVT_BUS_TRAN_MEM   00H (Self) 20H(Any)
EVT_BUS_DATA_RCV   00H (Self)
EVT_BUS_BNR_DRV   00H (Self)
EVT_BUS_HIT_DRV   00H (Self)
EVT_BUS_HITM_DRV   00H (Self)
EVT_BUS_SNOOP_STALL   00H (Self)
EVT_FLOPS Floating Point Unit 00H
EVT_FP_COMP_OPS_EXE   00H
EVT_FP_ASSIST   00H
EVT_MUL   00H
EVT_DIV   00H
EVT_CYCLES_DIV_BUSY   00H
EVT_LD_BLOCKS Memory Ordering 00H
EVT_SB_DRAINS   00H
EVT_MISALIGN_MEM_REF   00H
EVT_INST_RETIRED Instruction Decoding and Retirement 00H
EVT_UOPS_RETIRED   00H
EVT_INST_DECODER   00H
EVT_HW_INT_RX Interrupts 00H
EVT_CYCLES_INT_MASKED   00H
EVT_CYCLES_INT_PENDING_AND_MASKED   00H
EVT_BR_INST_RETIRED Branches 00H
EVT_BR_MISS_PRED_RETIRED   00H
EVT_BR_TAKEN_RETIRED   00H
EVT_BR_MISS_PRED_TAKEN_RET   00H
EVT_BR_INST_DECODED   00H
EVT_BTB_MISSES   00H
EVT_BR_BOGUS   00H
EVT_BACLEARS   00H
EVT_RESOURCE_STALLS Stalls 00H
EVT_PARTIAL_RAT_STALLS   00H
EVT_SEGMENT_REG_LOADS Segment Register Loads 00H
EVT_CPU_CLK_UNHALTED Clocks 00H

For example, if you wanted to accumulate the number of instructions executed in PerfCtr0 in user-mode only, you would use: EVT_INST_RETIRED | USER_ MODE. This value would then be passed to the PERFEVTSEL0_W ioctl(). For more description of the above events please refer to Appendix A of the Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide . This is a pdf file that is included with the perfmon documentation.

ioctl() Routines

Command Arguments Description
PERFEVTSEL0_W Address of 32-bit buffer (unsigned long) Sets the value of the PerfEvtSel0 register to the value contained in the passed-in buffer.
PERFEVTSEL1_W Address of 32-bit buffer (unsigned long) Sets the value of the PerfEvtSel1 register to the value contained in the passed-in buffer.
PERFEVTSEL0_R Address of 32-bit buffer (unsigned long) Returns the value of PerfEvtSel0 register in the passed-in buffer.
PERFEVTSEL1_R Address of 32-bit buffer (unsigned long) Returns the Value of PerfEvtSel1 register in the passed-in buffer.
STARTPERFCTRS None Enables performance counting in both PerfCtr0 and PerfCtr1
STOPPERFCTRS None Disables performance counting in both PerfCtr0 and PerfCtr1
STOPERFCTR0 None Disables performance counting in PerfCtr0 only.
STOPERFCT1 None Disables performance counting in PerfCtr1 only.
PERFCTR0_W Address of 32-bit buffer (unsigned long long) Sets the value of the PerfCtr0 register to the value contained in the passed-in buffer. Bits 32-39 are sign-extended from bit-31
PERFCTR1_W Address of 32-bit buffer (unsigned long long) Sets the value of the PerfCtr1 register to the value contained in the passed-in buffer. Bits 32-39 are sign-extended from bit-31
TSC_PERFCTRS_EN None This command invokes the perfmon routine allowing unprivileged access to the TSC and the performance-counting registers.
WBINVD_CACHES None This command invokes the perfmon routine that can issue a "write-back and invalidate" instruction. All external caches are invalidated and written back.
ioctl() interface

Library Routines

These library routines are prototyped in perfmon.h and can be included in your code by adding the compile time options: -L$PERFMON_HOME/lib -lperfmon. These routines are in the table below:

Prototype Description
unsigned long long read_tsc(void) Returns a 64-bit value of the current Time-Stamp-Counter (TSC) register
unsigned long long read_perfctr0(void) Returns a 64-bit value of the current PerfCtr0 register.
unsigned long long read_perfctr1(void) Returns a 64-bit value of the current PerfCtr1 register.
void cpu_serialize(void) Issues a cpuid instruction to serialize all proceeding instructions.
Library routines

Here is an example to compile a user source code, tick.c, and how to link it with the library routines and using the GNU gcc compiler:.


# Compile using the library routines:
gcc -I$(PERFMONHOME)/include -o tick -O tick.c -L$(PERFMON_HOME)/lib -lperfmon

Examples

In the examples directory, there are several example programs that show how to use Perfmon: