Perfmon User's Guide

Introduction

Perfmon is a tool that allows user-level code to access the performance counters present in the Ultra-series workstations and servers produced by Sun Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode).

Supported devices

Currently, the following devices are supported:

There are plans to add support for the following devices:

Installation

After you have extracted the distribution, you will find several directories:

Before you can use Perfmon, you must have your system administrator add the Perfmon package to the system(s) you wish to use. A quick rundown of how this might work:

	<become root>
	cd <installation_dir>/pkgs
	pkgadd -d MSUperf MSUperf

Note that Perfmon will only work on UltraSPARC-based machines (where the output of uname -m returns sun4u).

In case you ever need to uninstall Perfmon from a system, just do the following:

	<become root>
	pkgrm MSUperf

Usage

Requirements

In order to write programs using Perfmon, you will need to have access to perfmon.h (found in the include directory), libperfmon.a, and optionally perfmon32.il or perfmon64.il. If you are using a compiler that generates 32-bit code (such as Sun's C compiler version 4.x and lower), you will need to use perfmon32.il. If you are using a compiler that generates 64-bit code (such as some versions of gcc or Sun's C compiler version 5.x), you will need perfmon64.il.

The .il files mentioned above are UltraSPARC-specific assembly language routines for inlining (for more information, see the inline(1) man page). This format is known to work with Sun's C and C++ compilers, but may need some modification for use with gcc.

Note that code written using Perfmon will only run on machines where the Perfmon driver is installed and loaded. Attempting to run programs using Perfmon on other machines may result in strange behavior, core dumps, illegal instruction errors, etc.

Writing code

There are three types of Perfmon routines:

Both the inline and library routines can be run inside of a user application as you would expect. The driver routines need to be accessed via ioctl(). To use ioctl(), you must first open the Perfmon device (accessible through /dev/perfmon). After the device is open, you simply use ioctl() to communicate to the driver what routine you wish to run (passing arguments as necessary). Here is an example code segment which opens the device and issues a cache flush request on the current CPU:

	#include <stdio.h>
	#include <fcntl.h>
	#include "perfmon.h"

	main()
	{
	    int fd;
	    int rc;

	    fd = open("/dev/perfmon", O_RDONLY);
	    if (fd == -1) {
		perror("open(/dev/perfmon)");
		exit(1);
	    }

	    /* Tell the driver to flush the cache of the current CPU */
	    rc = ioctl(fd, PERFMON_FLUSH_CACHE);
	    if (rc < 0) {
		perror("ioctl(PERFMON_FLUSH_CACHE)");
		exit(1);
	    }
	}

UltraSPARC Performance Registers

The UltraSPARC CPU has two 64-bit registers that are used for gathering performance data. The Performance Control Register (PCR) and the Performance Instrumentation Counters (PIC). These registers reflect events that happen on a per-processor basis. For best results, it is recommended that you run your program on an MP machine and bind your process to a specific CPU to prevent process migration to another CPU and the loss of performance data that has been collected.

Access to the PCR is privileged. The PCR can only be accessed by using the PERFMON_GETPCR and PERFMON_SETPCR ioctl() routines in the Perfmon driver (see the next section for more details). The PCR has the following bitfields (taken from Appendix B of the UltraSPARC-I User's Manual):
PCR register format
Name Bits Description
PRIV 0 Privileged. If set, non-privileged access to the PIC will cause a privileged_action trap. For programs using Perfmon, this should always be set to 0.
ST 1 System_trace. If set, events in privileged (system) mode are accumulated. This may be set along with PCR.UT to accumulate all events.
UT 2 User_trace. If set, events in non-privileged (user) mode are accumulated. This may be set along with PCR.ST to accumulate all events.
S0 4-7 Designates the type of event to accumulate in PIC.D0 (PIC0). See the table below for more information.
S1 11-14 Designates the type of event to accumulate in PIC.D1 (PIC1). See the table below for more information.
All other bitfields are reserved and should be set to 0.

If the PCR.PRIV bit is clear, the PIC register can be accessed by user mode programs. Hand-coded assembly routines for doing this are located in the perfmon.il files and in the perfmon library. The PIC register has the following format:
PIC register format
Name Bits Description
D0 0-31 A 32-bit counter that represents the number of events accumulated specified by the PCR.S0 field.
D1 32-63 A 32-bit counter that represents the number of events accumulated specified by the PCR.S1 field.

The include file perfmon.h has encodings for the various fields in the PCR register. These encodings are pre-shifted so that when they are inclusive-ORed together, they produce a value suitable for writing directly to the PCR register. The defined values are:
Perfmon defined PCR values
Name PIC field Description
PCR_PRIV_MODE N/A Sets PIC access to privileged-mode only. This should probably never be used for programs using Perfmon
PCR_SYS_TRACE N/A Causes events to be accumulated while in privileged (system) mode.
PCR_USER_TRACE N/A Causes events to be accumulated while in non-privileged (user) mode.
PCR_S0_CYCLE_CNT PIC0 Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.
PCR_S0_INSTR_CNT PIC0 The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.
PCR_S0_STALL_IC_MISS PIC0 I-buffer is empty due to an I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs.
PCR_S0_STALL_STORBUF PIC0 The store buffer cannot hold additional stores, and a store instruction is the first instruction in the group.
PCR_S0_IC_REF PIC0 I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed.
PCR_S0_DC_READ PIC0 D-Cache read references (including accesses that subsequently trap). Non-D-Cacheable accesses are not counted. Atomic instructions, block loads, "internal", and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class.
PCR_S0_DC_WRITE PIC0 D-Cache write references (including accesses that subsequently trap). Non-D-Cacheable accesses are not counted.
PCR_S0_STALL_LOAD PIC0 An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages. This also counts cases where no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic.
PCR_S0_EC_REF PIC0 Total E-Cache references. Non-cacheable accesses are not counted. NOTE: The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses.
PCR_S0_EC_WRITE_RO PIC0 E-Cache hits that do a read for ownership UPA transaction.
PCR_S0_EC_SNOOP_INV PIC0 E-Cache invalidations from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ.
PCR_S0_EC_READ_HIT PIC0 E-Cache read hits from D-Cache misses. NOTE: The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count.
PCR_S1_CYCLE_CNT PIC1 Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.
PCR_S1_INSTR_CNT PIC1 The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.
PCR_S1_STALL_MISPRED PIC1 I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count.
PCR_S1_STALL_FPDEP PIC1 First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a PCR_S0_STALL_LOAD. This, PCR_S1_STALL_FPDEP and PCR_S0_STALL_LOAD are mutually exclusive counts.
PCR_S1_IC_HIT PIC1 I-Cache hits.
PCR_S1_DC_READ_HIT PIC1 D-Cache read hits are counted in one of two places: 1) When they access the D-Cache tags and do not enter the load buffer (because it is already empty). 2) When they exit the load buffer (due to a D-Cache miss or a non-empty load buffer)
PCR_S1_DC_WRITE_HIT PIC1 D-Cache write hits.
PCR_S1_LOAD_STALL_RAW PIC1 There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store.
PCR_S1_EC_HIT PIC1 Total E-Cache hits.
PCR_S1_EC_WRITEBACK PIC1 E-Cache misses that do writebacks.
PCR_S1_EC_SNOOP_COPYBCK PIC1 E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.
PCR_S1_EC_IC_HIT PIC1 E-Cache read hits from I-Cache misses.
For example, if you wanted to accumulate the number of instructions executed in PIC0 and the number of cycles used in PIC1 while executing in user-mode only, you would use: PCR_S1_CYCLE_CNT | PCR_S0_INSTR_CNT | PCR_USER_TRACE. This value would then be passed to the PERFMON_SETPCR ioctl().

ioctl() Routines

ioctl() interface
Function Arguments Description
PERFMON_FLUSH_CACHE None Flushes both L1 and L2 caches on the CPU that the calling thread is running on.
PERFMON_GETPCR Address of 64-bit buffer (unsigned long long for Sun's C 4.x compilers) Gets the current value of the UltraSPARC PCR register that the calling thread is running on and places it in the passed buffer. See the UltraSPARC User's Manual for details on register format.
PERFMON_SETPCR Address of a 64-bit buffer Sets the value of the UltraSPARC PCR register to the value that is contained in the passed-in buffer.

Library Routines

These library routines are prototyped in perfmon.h and can be included in your code by adding the compile time options: -L$PERFMON_HOME/lib -lperfmon. See the Inline Routines section for descriptions of library functions that are duplicated by inline functions.
Library routines
Prototype Description
void cpu_sync() This routine executes a membar #Sync instruction which does a barrier synchronization. After this instruction completes, all previous instructions and memory accesses are complete.
void clr_pic() This clears all the bits in both PIC0 and PIC1.

Inline Routines

There are several inline assembly language routines that are part of Perfmon. They are prototyped in perfmon.h:
Inline routines
Prototype Description
unsigned long long get_tick() This gets the current value of the TICK register. This register represents the number of clock cycles that have happened since the processor was last powered-on (or reset).
unsigned long long get_pic() This atomically reads both PIC0 and PIC1.
unsigned long get_pic0() Read only the value in PIC0.
unsigned long get_pic1() Read only the value in PIC1.
unsigned long extract_pic0(unsigned long long) Given the 64-bit PIC, extract PIC.D0 (lower 32 bits).
unsigned long extract_pic1(unsigned long long) Given the 64-bit PIC, extract PIC.D1 (upper 32 bits).
In order to compile programs that use the inline versions (as opposed to the library versions), you must include the .il file on your compile command line. Also, since the routines use SPARC-V9 specific instructions, you must add the -xarch=v8plusa flag to your compile line so that the compiler will allow V9 instructions in your final executable. For example:

	# Compile using inline routines
	cc -xarch=v8plusa -o tick tick.c perfmon32.il -L$(PERFMON_HOME)/lib -lperfmon

	# Compile without inline routines (use the library versions)
	cc -xarch=v8plusa -o tick tick.c -L$(PERFMON_HOME)/lib -lperfmon

Examples

In the examples directory, there are several example programs that show how to use Perfmon: