linux/arch/x86
Andi Kleen 724697648e perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
base. The basic mechanism of abusing the inverse cmask to get all
cycles works the same as before.

PREC_DIST is available on Sandy Bridge or later. It had some problems
on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
on Broadwell and Skylake.

PREC_DIST has special support for avoiding shadow effects, which can
give better results compare to UOPS_RETIRED. The drawback is that
PREC_DIST can only schedule on counter 1, but that is ok for cycle
sampling, as there is normally no need to do multiple cycle sampling
runs in parallel. It is still possible to run perf top in parallel, as
that doesn't use precise mode. Also of course the multiplexing can
still allow parallel operation.

:pp stays with the previous event.

Example:

Sample a loop with 10 sqrt with old cycles:pp

	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
	  9.13 │      sqrtps %xmm1,%xmm0
	 11.58 │      sqrtps %xmm1,%xmm0
	 11.51 │      sqrtps %xmm1,%xmm0
	  6.27 │      sqrtps %xmm1,%xmm0
	 10.38 │      sqrtps %xmm1,%xmm0
	 12.20 │      sqrtps %xmm1,%xmm0
	 12.74 │      sqrtps %xmm1,%xmm0
	  5.40 │      sqrtps %xmm1,%xmm0
	 10.14 │      sqrtps %xmm1,%xmm0
	 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the JMP is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone and all
instructions show up roughly evenly:

	  9.51 │10:   sqrtps %xmm1,%xmm0
	 11.74 │      sqrtps %xmm1,%xmm0
	 11.84 │      sqrtps %xmm1,%xmm0
	  6.05 │      sqrtps %xmm1,%xmm0
	 10.46 │      sqrtps %xmm1,%xmm0
	 12.25 │      sqrtps %xmm1,%xmm0
	 12.18 │      sqrtps %xmm1,%xmm0
	  5.26 │      sqrtps %xmm1,%xmm0
	 10.13 │      sqrtps %xmm1,%xmm0
	 10.43 │      sqrtps %xmm1,%xmm0
	  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result is not
completely even, but systematic shadow effects are significantly
reduced.

The improvements are mainly expected to make a difference in high IPC
code. With low IPC it should be similar.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:15:32 +01:00
..
boot x86/mm: Fix regression with huge pages on PAE 2015-12-04 09:14:27 +01:00
configs Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2015-09-04 15:49:32 -07:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2015-11-04 09:11:12 -08:00
entry x86/entry/64: Fix irqflag tracing wrt context tracking 2015-11-24 09:55:02 +01:00
ia32 Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-11-03 21:05:40 -08:00
include Linux 4.4-rc5 2015-12-14 09:31:23 +01:00
kernel perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp 2016-01-06 11:15:32 +01:00
kvm KVM: nVMX: remove incorrect vpid check in nested invvpid emulation 2015-11-25 15:52:55 +01:00
lguest genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
lib x86, tracing, perf: Add trace point for MSR accesses 2015-12-06 12:56:10 +01:00
math-emu Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-11-03 21:05:40 -08:00
mm x86/mpx: Fix instruction decoder condition 2015-12-05 18:52:14 +01:00
net ebpf: migrate bpf_prog's flags to bitfield 2015-10-03 05:02:39 -07:00
oprofile
pci Merge branches 'acpi-pci' and 'pm-pci' 2015-12-04 14:01:02 +01:00
platform Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-11-03 21:33:18 -08:00
power x86/ldt: Make modify_ldt synchronous 2015-07-31 10:23:23 +02:00
purgatory
ras x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC 2015-10-12 16:15:48 +02:00
realmode
tools
um um: Fix fpstate handling 2015-12-08 22:25:40 +01:00
video
xen xen: features for 4.4-rc0 2015-11-04 17:32:42 -08:00
.gitignore
Kbuild x86/asm/entry, x86/vdso: Move the vDSO code to arch/x86/entry/vdso/ 2015-06-03 18:51:37 +02:00
Kconfig Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-11-03 18:59:10 -08:00
Kconfig.cpu x86/Kconfig/cpus: Fix/complete CPU type help texts 2015-10-21 11:12:56 +02:00
Kconfig.debug x86: don't make DEBUG_WX default to 'y' even with DEBUG_RODATA 2015-11-06 09:12:41 -08:00
Makefile Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2015-11-04 09:11:12 -08:00
Makefile_32.cpu
Makefile.um