linux

korg/linux

Fork 0

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-16 16:54:20 +08:00

Commit Graph

Author	SHA1	Message	Date
Ingo Molnar	5042afe7fe	generic: Use raw local irq variant for generic cmpxchg The interrupt disabled region is extremly tiny and therefor not latency relevant. Avoid cluttering the traces with those pointless entries. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2013-02-19 08:43:37 +01:00
David Howells	df9ee29270	Fix IRQ flag handling naming Fix the IRQ flag handling naming. In linux/irqflags.h under one configuration, it maps: local_irq_enable() -> raw_local_irq_enable() local_irq_disable() -> raw_local_irq_disable() local_irq_save() -> raw_local_irq_save() ... and under the other configuration, it maps: raw_local_irq_enable() -> local_irq_enable() raw_local_irq_disable() -> local_irq_disable() raw_local_irq_save() -> local_irq_save() ... This is quite confusing. There should be one set of names expected of the arch, and this should be wrapped to give another set of names that are expected by users of this facility. Change this to have the arch provide: flags = arch_local_save_flags() flags = arch_local_irq_save() arch_local_irq_restore(flags) arch_local_irq_disable() arch_local_irq_enable() arch_irqs_disabled_flags(flags) arch_irqs_disabled() arch_safe_halt() Then linux/irqflags.h wraps these to provide: raw_local_save_flags(flags) raw_local_irq_save(flags) raw_local_irq_restore(flags) raw_local_irq_disable() raw_local_irq_enable() raw_irqs_disabled_flags(flags) raw_irqs_disabled() raw_safe_halt() with type checking on the flags 'arguments', and then wraps those to provide: local_save_flags(flags) local_irq_save(flags) local_irq_restore(flags) local_irq_disable() local_irq_enable() irqs_disabled_flags(flags) irqs_disabled() safe_halt() with tracing included if enabled. The arch functions can now all be inline functions rather than some of them having to be macros. Signed-off-by: David Howells <dhowells@redhat.com> [X86, FRV, MN10300] Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> [Tile] Signed-off-by: Michal Simek <monstr@monstr.eu> [Microblaze] Tested-by: Catalin Marinas <catalin.marinas@arm.com> [ARM] Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> [AVR] Acked-by: Tony Luck <tony.luck@intel.com> [IA-64] Acked-by: Hirokazu Takata <takata@linux-m32r.org> [M32R] Acked-by: Greg Ungerer <gerg@uclinux.org> [M68K/M68KNOMMU] Acked-by: Ralf Baechle <ralf@linux-mips.org> [MIPS] Acked-by: Kyle McMartin <kyle@mcmartin.ca> [PA-RISC] Acked-by: Paul Mackerras <paulus@samba.org> [PowerPC] Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] Acked-by: Chen Liqin <liqin.chen@sunplusct.com> [Score] Acked-by: Matt Fleming <matt@console-pimps.org> [SH] Acked-by: David S. Miller <davem@davemloft.net> [Sparc] Acked-by: Chris Zankel <chris@zankel.net> [Xtensa] Reviewed-by: Richard Henderson <rth@twiddle.net> [Alpha] Reviewed-by: Yoshinori Sato <ysato@users.sourceforge.jp> [H8300] Cc: starvik@axis.com [CRIS] Cc: jesper.nilsson@axis.com [CRIS] Cc: linux-cris-kernel@axis.com	2010-10-07 14:08:55 +01:00
Mathieu Desnoyers	068fbad288	Add cmpxchg_local to asm-generic for per cpu atomic operations Emulates the cmpxchg_local by disabling interrupts around variable modification. This is not reentrant wrt NMIs and MCEs. It is only protected against normal interrupts, but this is enough for architectures without such interrupt sources or if used in a context where the data is not shared with such handlers. It can be used as a fallback for architectures lacking a real cmpxchg instruction. For architectures that have a real cmpxchg but does not have NMIs or MCE, testing which of the generic vs architecture specific cmpxchg is the fastest should be done. asm-generic/cmpxchg.h defines a cmpxchg that uses cmpxchg_local. It is meant to be used as a cmpxchg fallback for architectures that do not support SMP. * Patch series comments Using cmpxchg_local shows a performance improvements of the fast path goes from a 66% speedup on a Pentium 4 to a 14% speedup on AMD64. In detail: Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Measurements on a Pentium4, 3GHz, Hyperthread. SLUB Performance testing ======================== 1. Kmalloc: Repeatedly allocate then free test * slub HEAD, test 1 kmalloc(8) = 201 cycles kfree = 351 cycles kmalloc(16) = 198 cycles kfree = 359 cycles kmalloc(32) = 200 cycles kfree = 381 cycles kmalloc(64) = 224 cycles kfree = 394 cycles kmalloc(128) = 285 cycles kfree = 424 cycles kmalloc(256) = 411 cycles kfree = 546 cycles kmalloc(512) = 480 cycles kfree = 619 cycles kmalloc(1024) = 623 cycles kfree = 750 cycles kmalloc(2048) = 686 cycles kfree = 811 cycles kmalloc(4096) = 482 cycles kfree = 538 cycles kmalloc(8192) = 680 cycles kfree = 734 cycles kmalloc(16384) = 713 cycles kfree = 843 cycles * Slub HEAD, test 2 kmalloc(8) = 190 cycles kfree = 351 cycles kmalloc(16) = 195 cycles kfree = 360 cycles kmalloc(32) = 201 cycles kfree = 370 cycles kmalloc(64) = 245 cycles kfree = 389 cycles kmalloc(128) = 283 cycles kfree = 413 cycles kmalloc(256) = 409 cycles kfree = 547 cycles kmalloc(512) = 476 cycles kfree = 616 cycles kmalloc(1024) = 628 cycles kfree = 753 cycles kmalloc(2048) = 684 cycles kfree = 811 cycles kmalloc(4096) = 480 cycles kfree = 539 cycles kmalloc(8192) = 661 cycles kfree = 746 cycles kmalloc(16384) = 741 cycles kfree = 856 cycles * cmpxchg_local Slub test kmalloc(8) = 83 cycles kfree = 363 cycles kmalloc(16) = 85 cycles kfree = 372 cycles kmalloc(32) = 92 cycles kfree = 377 cycles kmalloc(64) = 115 cycles kfree = 397 cycles kmalloc(128) = 179 cycles kfree = 438 cycles kmalloc(256) = 314 cycles kfree = 564 cycles kmalloc(512) = 398 cycles kfree = 615 cycles kmalloc(1024) = 573 cycles kfree = 745 cycles kmalloc(2048) = 629 cycles kfree = 816 cycles kmalloc(4096) = 473 cycles kfree = 548 cycles kmalloc(8192) = 659 cycles kfree = 745 cycles kmalloc(16384) = 724 cycles kfree = 843 cycles 2. Kmalloc: alloc/free test * slub HEAD, test 1 kmalloc(8)/kfree = 322 cycles kmalloc(16)/kfree = 318 cycles kmalloc(32)/kfree = 318 cycles kmalloc(64)/kfree = 325 cycles kmalloc(128)/kfree = 318 cycles kmalloc(256)/kfree = 328 cycles kmalloc(512)/kfree = 328 cycles kmalloc(1024)/kfree = 328 cycles kmalloc(2048)/kfree = 328 cycles kmalloc(4096)/kfree = 678 cycles kmalloc(8192)/kfree = 1013 cycles kmalloc(16384)/kfree = 1157 cycles * Slub HEAD, test 2 kmalloc(8)/kfree = 323 cycles kmalloc(16)/kfree = 318 cycles kmalloc(32)/kfree = 318 cycles kmalloc(64)/kfree = 318 cycles kmalloc(128)/kfree = 318 cycles kmalloc(256)/kfree = 328 cycles kmalloc(512)/kfree = 328 cycles kmalloc(1024)/kfree = 328 cycles kmalloc(2048)/kfree = 328 cycles kmalloc(4096)/kfree = 648 cycles kmalloc(8192)/kfree = 1009 cycles kmalloc(16384)/kfree = 1105 cycles * cmpxchg_local Slub test kmalloc(8)/kfree = 112 cycles kmalloc(16)/kfree = 103 cycles kmalloc(32)/kfree = 103 cycles kmalloc(64)/kfree = 103 cycles kmalloc(128)/kfree = 112 cycles kmalloc(256)/kfree = 111 cycles kmalloc(512)/kfree = 111 cycles kmalloc(1024)/kfree = 111 cycles kmalloc(2048)/kfree = 121 cycles kmalloc(4096)/kfree = 650 cycles kmalloc(8192)/kfree = 1042 cycles kmalloc(16384)/kfree = 1149 cycles Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Measurements on a AMD64 2.0 GHz dual-core In this test, we seem to remove 10 cycles from the kmalloc fast path. On small allocations, it gives a 14% performance increase. kfree fast path also seems to have a 10 cycles improvement. 1. Kmalloc: Repeatedly allocate then free test * cmpxchg_local slub kmalloc(8) = 63 cycles kfree = 126 cycles kmalloc(16) = 66 cycles kfree = 129 cycles kmalloc(32) = 76 cycles kfree = 138 cycles kmalloc(64) = 100 cycles kfree = 288 cycles kmalloc(128) = 128 cycles kfree = 309 cycles kmalloc(256) = 170 cycles kfree = 315 cycles kmalloc(512) = 221 cycles kfree = 357 cycles kmalloc(1024) = 324 cycles kfree = 393 cycles kmalloc(2048) = 354 cycles kfree = 440 cycles kmalloc(4096) = 394 cycles kfree = 330 cycles kmalloc(8192) = 523 cycles kfree = 481 cycles kmalloc(16384) = 643 cycles kfree = 649 cycles * Base kmalloc(8) = 74 cycles kfree = 113 cycles kmalloc(16) = 76 cycles kfree = 116 cycles kmalloc(32) = 85 cycles kfree = 133 cycles kmalloc(64) = 111 cycles kfree = 279 cycles kmalloc(128) = 138 cycles kfree = 294 cycles kmalloc(256) = 181 cycles kfree = 304 cycles kmalloc(512) = 237 cycles kfree = 327 cycles kmalloc(1024) = 340 cycles kfree = 379 cycles kmalloc(2048) = 378 cycles kfree = 433 cycles kmalloc(4096) = 399 cycles kfree = 329 cycles kmalloc(8192) = 528 cycles kfree = 624 cycles kmalloc(16384) = 651 cycles kfree = 737 cycles 2. Kmalloc: alloc/free test * cmpxchg_local slub kmalloc(8)/kfree = 96 cycles kmalloc(16)/kfree = 97 cycles kmalloc(32)/kfree = 97 cycles kmalloc(64)/kfree = 97 cycles kmalloc(128)/kfree = 97 cycles kmalloc(256)/kfree = 105 cycles kmalloc(512)/kfree = 108 cycles kmalloc(1024)/kfree = 105 cycles kmalloc(2048)/kfree = 107 cycles kmalloc(4096)/kfree = 390 cycles kmalloc(8192)/kfree = 626 cycles kmalloc(16384)/kfree = 662 cycles * Base kmalloc(8)/kfree = 116 cycles kmalloc(16)/kfree = 116 cycles kmalloc(32)/kfree = 116 cycles kmalloc(64)/kfree = 116 cycles kmalloc(128)/kfree = 116 cycles kmalloc(256)/kfree = 126 cycles kmalloc(512)/kfree = 126 cycles kmalloc(1024)/kfree = 126 cycles kmalloc(2048)/kfree = 126 cycles kmalloc(4096)/kfree = 384 cycles kmalloc(8192)/kfree = 749 cycles kmalloc(16384)/kfree = 786 cycles Tested-by: Christoph Lameter <clameter@sgi.com> I can confirm Mathieus' measurement now: Athlon64: regular NUMA/discontig 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 79 cycles kfree -> 92 cycles 10000 times kmalloc(16) -> 79 cycles kfree -> 93 cycles 10000 times kmalloc(32) -> 88 cycles kfree -> 95 cycles 10000 times kmalloc(64) -> 124 cycles kfree -> 132 cycles 10000 times kmalloc(128) -> 157 cycles kfree -> 247 cycles 10000 times kmalloc(256) -> 200 cycles kfree -> 257 cycles 10000 times kmalloc(512) -> 250 cycles kfree -> 277 cycles 10000 times kmalloc(1024) -> 337 cycles kfree -> 314 cycles 10000 times kmalloc(2048) -> 365 cycles kfree -> 330 cycles 10000 times kmalloc(4096) -> 352 cycles kfree -> 240 cycles 10000 times kmalloc(8192) -> 456 cycles kfree -> 340 cycles 10000 times kmalloc(16384) -> 646 cycles kfree -> 471 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 124 cycles 10000 times kmalloc(16)/kfree -> 124 cycles 10000 times kmalloc(32)/kfree -> 124 cycles 10000 times kmalloc(64)/kfree -> 124 cycles 10000 times kmalloc(128)/kfree -> 124 cycles 10000 times kmalloc(256)/kfree -> 132 cycles 10000 times kmalloc(512)/kfree -> 132 cycles 10000 times kmalloc(1024)/kfree -> 132 cycles 10000 times kmalloc(2048)/kfree -> 132 cycles 10000 times kmalloc(4096)/kfree -> 319 cycles 10000 times kmalloc(8192)/kfree -> 486 cycles 10000 times kmalloc(16384)/kfree -> 539 cycles cmpxchg_local NUMA/discontig 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 55 cycles kfree -> 90 cycles 10000 times kmalloc(16) -> 55 cycles kfree -> 92 cycles 10000 times kmalloc(32) -> 70 cycles kfree -> 91 cycles 10000 times kmalloc(64) -> 100 cycles kfree -> 141 cycles 10000 times kmalloc(128) -> 128 cycles kfree -> 233 cycles 10000 times kmalloc(256) -> 172 cycles kfree -> 251 cycles 10000 times kmalloc(512) -> 225 cycles kfree -> 275 cycles 10000 times kmalloc(1024) -> 325 cycles kfree -> 311 cycles 10000 times kmalloc(2048) -> 346 cycles kfree -> 330 cycles 10000 times kmalloc(4096) -> 351 cycles kfree -> 238 cycles 10000 times kmalloc(8192) -> 450 cycles kfree -> 342 cycles 10000 times kmalloc(16384) -> 630 cycles kfree -> 546 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 81 cycles 10000 times kmalloc(16)/kfree -> 81 cycles 10000 times kmalloc(32)/kfree -> 81 cycles 10000 times kmalloc(64)/kfree -> 81 cycles 10000 times kmalloc(128)/kfree -> 81 cycles 10000 times kmalloc(256)/kfree -> 91 cycles 10000 times kmalloc(512)/kfree -> 90 cycles 10000 times kmalloc(1024)/kfree -> 91 cycles 10000 times kmalloc(2048)/kfree -> 90 cycles 10000 times kmalloc(4096)/kfree -> 318 cycles 10000 times kmalloc(8192)/kfree -> 483 cycles 10000 times kmalloc(16384)/kfree -> 536 cycles Changelog: - Ran though checkpatch. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:30 -08:00

Author

SHA1

Message

Date

Ingo Molnar

5042afe7fe

generic: Use raw local irq variant for generic cmpxchg

The interrupt disabled region is extremly tiny and therefor not
latency relevant. Avoid cluttering the traces with those pointless
entries.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

2013-02-19 08:43:37 +01:00

David Howells

df9ee29270

Fix IRQ flag handling naming

Fix the IRQ flag handling naming.  In linux/irqflags.h under one configuration,
it maps:

	local_irq_enable() -> raw_local_irq_enable()
	local_irq_disable() -> raw_local_irq_disable()
	local_irq_save() -> raw_local_irq_save()
	...

and under the other configuration, it maps:

	raw_local_irq_enable() -> local_irq_enable()
	raw_local_irq_disable() -> local_irq_disable()
	raw_local_irq_save() -> local_irq_save()
	...

This is quite confusing.  There should be one set of names expected of the
arch, and this should be wrapped to give another set of names that are expected
by users of this facility.

Change this to have the arch provide:

	flags = arch_local_save_flags()
	flags = arch_local_irq_save()
	arch_local_irq_restore(flags)
	arch_local_irq_disable()
	arch_local_irq_enable()
	arch_irqs_disabled_flags(flags)
	arch_irqs_disabled()
	arch_safe_halt()

Then linux/irqflags.h wraps these to provide:

	raw_local_save_flags(flags)
	raw_local_irq_save(flags)
	raw_local_irq_restore(flags)
	raw_local_irq_disable()
	raw_local_irq_enable()
	raw_irqs_disabled_flags(flags)
	raw_irqs_disabled()
	raw_safe_halt()

with type checking on the flags 'arguments', and then wraps those to provide:

	local_save_flags(flags)
	local_irq_save(flags)
	local_irq_restore(flags)
	local_irq_disable()
	local_irq_enable()
	irqs_disabled_flags(flags)
	irqs_disabled()
	safe_halt()

with tracing included if enabled.

The arch functions can now all be inline functions rather than some of them
having to be macros.

Signed-off-by: David Howells <dhowells@redhat.com> [X86, FRV, MN10300]
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> [Tile]
Signed-off-by: Michal Simek <monstr@monstr.eu> [Microblaze]
Tested-by: Catalin Marinas <catalin.marinas@arm.com> [ARM]
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> [AVR]
Acked-by: Tony Luck <tony.luck@intel.com> [IA-64]
Acked-by: Hirokazu Takata <takata@linux-m32r.org> [M32R]
Acked-by: Greg Ungerer <gerg@uclinux.org> [M68K/M68KNOMMU]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [MIPS]
Acked-by: Kyle McMartin <kyle@mcmartin.ca> [PA-RISC]
Acked-by: Paul Mackerras <paulus@samba.org> [PowerPC]
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [S390]
Acked-by: Chen Liqin <liqin.chen@sunplusct.com> [Score]
Acked-by: Matt Fleming <matt@console-pimps.org> [SH]
Acked-by: David S. Miller <davem@davemloft.net> [Sparc]
Acked-by: Chris Zankel <chris@zankel.net> [Xtensa]
Reviewed-by: Richard Henderson <rth@twiddle.net> [Alpha]
Reviewed-by: Yoshinori Sato <ysato@users.sourceforge.jp> [H8300]
Cc: starvik@axis.com [CRIS]
Cc: jesper.nilsson@axis.com [CRIS]
Cc: linux-cris-kernel@axis.com

2010-10-07 14:08:55 +01:00

Mathieu Desnoyers

068fbad288

Add cmpxchg_local to asm-generic for per cpu atomic operations

Emulates the cmpxchg_local by disabling interrupts around variable modification.
This is not reentrant wrt NMIs and MCEs. It is only protected against normal
interrupts, but this is enough for architectures without such interrupt sources
or if used in a context where the data is not shared with such handlers.

It can be used as a fallback for architectures lacking a real cmpxchg
instruction.

For architectures that have a real cmpxchg but does not have NMIs or MCE,
testing which of the generic vs architecture specific cmpxchg is the fastest
should be done.

asm-generic/cmpxchg.h defines a cmpxchg that uses cmpxchg_local. It is meant to
be used as a cmpxchg fallback for architectures that do not support SMP.

* Patch series comments

Using cmpxchg_local shows a performance improvements of the fast path goes from
a 66% speedup on a Pentium 4 to a 14% speedup on AMD64.

In detail:

Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Measurements on a Pentium4, 3GHz, Hyperthread.
SLUB Performance testing
========================
1. Kmalloc: Repeatedly allocate then free test

* slub HEAD, test 1
kmalloc(8) = 201 cycles         kfree = 351 cycles
kmalloc(16) = 198 cycles        kfree = 359 cycles
kmalloc(32) = 200 cycles        kfree = 381 cycles
kmalloc(64) = 224 cycles        kfree = 394 cycles
kmalloc(128) = 285 cycles       kfree = 424 cycles
kmalloc(256) = 411 cycles       kfree = 546 cycles
kmalloc(512) = 480 cycles       kfree = 619 cycles
kmalloc(1024) = 623 cycles      kfree = 750 cycles
kmalloc(2048) = 686 cycles      kfree = 811 cycles
kmalloc(4096) = 482 cycles      kfree = 538 cycles
kmalloc(8192) = 680 cycles      kfree = 734 cycles
kmalloc(16384) = 713 cycles     kfree = 843 cycles

* Slub HEAD, test 2
kmalloc(8) = 190 cycles         kfree = 351 cycles
kmalloc(16) = 195 cycles        kfree = 360 cycles
kmalloc(32) = 201 cycles        kfree = 370 cycles
kmalloc(64) = 245 cycles        kfree = 389 cycles
kmalloc(128) = 283 cycles       kfree = 413 cycles
kmalloc(256) = 409 cycles       kfree = 547 cycles
kmalloc(512) = 476 cycles       kfree = 616 cycles
kmalloc(1024) = 628 cycles      kfree = 753 cycles
kmalloc(2048) = 684 cycles      kfree = 811 cycles
kmalloc(4096) = 480 cycles      kfree = 539 cycles
kmalloc(8192) = 661 cycles      kfree = 746 cycles
kmalloc(16384) = 741 cycles     kfree = 856 cycles

* cmpxchg_local Slub test
kmalloc(8) = 83 cycles          kfree = 363 cycles
kmalloc(16) = 85 cycles         kfree = 372 cycles
kmalloc(32) = 92 cycles         kfree = 377 cycles
kmalloc(64) = 115 cycles        kfree = 397 cycles
kmalloc(128) = 179 cycles       kfree = 438 cycles
kmalloc(256) = 314 cycles       kfree = 564 cycles
kmalloc(512) = 398 cycles       kfree = 615 cycles
kmalloc(1024) = 573 cycles      kfree = 745 cycles
kmalloc(2048) = 629 cycles      kfree = 816 cycles
kmalloc(4096) = 473 cycles      kfree = 548 cycles
kmalloc(8192) = 659 cycles      kfree = 745 cycles
kmalloc(16384) = 724 cycles     kfree = 843 cycles

2. Kmalloc: alloc/free test

* slub HEAD, test 1
kmalloc(8)/kfree = 322 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 325 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1013 cycles
kmalloc(16384)/kfree = 1157 cycles

* Slub HEAD, test 2
kmalloc(8)/kfree = 323 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 318 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 648 cycles
kmalloc(8192)/kfree = 1009 cycles
kmalloc(16384)/kfree = 1105 cycles

* cmpxchg_local Slub test
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Measurements on a AMD64 2.0 GHz dual-core

In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.

1. Kmalloc: Repeatedly allocate then free test

* cmpxchg_local slub
kmalloc(8) = 63 cycles      kfree = 126 cycles
kmalloc(16) = 66 cycles     kfree = 129 cycles
kmalloc(32) = 76 cycles     kfree = 138 cycles
kmalloc(64) = 100 cycles    kfree = 288 cycles
kmalloc(128) = 128 cycles   kfree = 309 cycles
kmalloc(256) = 170 cycles   kfree = 315 cycles
kmalloc(512) = 221 cycles   kfree = 357 cycles
kmalloc(1024) = 324 cycles  kfree = 393 cycles
kmalloc(2048) = 354 cycles  kfree = 440 cycles
kmalloc(4096) = 394 cycles  kfree = 330 cycles
kmalloc(8192) = 523 cycles  kfree = 481 cycles
kmalloc(16384) = 643 cycles kfree = 649 cycles

* Base
kmalloc(8) = 74 cycles      kfree = 113 cycles
kmalloc(16) = 76 cycles     kfree = 116 cycles
kmalloc(32) = 85 cycles     kfree = 133 cycles
kmalloc(64) = 111 cycles    kfree = 279 cycles
kmalloc(128) = 138 cycles   kfree = 294 cycles
kmalloc(256) = 181 cycles   kfree = 304 cycles
kmalloc(512) = 237 cycles   kfree = 327 cycles
kmalloc(1024) = 340 cycles  kfree = 379 cycles
kmalloc(2048) = 378 cycles  kfree = 433 cycles
kmalloc(4096) = 399 cycles  kfree = 329 cycles
kmalloc(8192) = 528 cycles  kfree = 624 cycles
kmalloc(16384) = 651 cycles kfree = 737 cycles

2. Kmalloc: alloc/free test

* cmpxchg_local slub
kmalloc(8)/kfree = 96 cycles
kmalloc(16)/kfree = 97 cycles
kmalloc(32)/kfree = 97 cycles
kmalloc(64)/kfree = 97 cycles
kmalloc(128)/kfree = 97 cycles
kmalloc(256)/kfree = 105 cycles
kmalloc(512)/kfree = 108 cycles
kmalloc(1024)/kfree = 105 cycles
kmalloc(2048)/kfree = 107 cycles
kmalloc(4096)/kfree = 390 cycles
kmalloc(8192)/kfree = 626 cycles
kmalloc(16384)/kfree = 662 cycles

* Base
kmalloc(8)/kfree = 116 cycles
kmalloc(16)/kfree = 116 cycles
kmalloc(32)/kfree = 116 cycles
kmalloc(64)/kfree = 116 cycles
kmalloc(128)/kfree = 116 cycles
kmalloc(256)/kfree = 126 cycles
kmalloc(512)/kfree = 126 cycles
kmalloc(1024)/kfree = 126 cycles
kmalloc(2048)/kfree = 126 cycles
kmalloc(4096)/kfree = 384 cycles
kmalloc(8192)/kfree = 749 cycles
kmalloc(16384)/kfree = 786 cycles

Tested-by: Christoph Lameter <clameter@sgi.com>
I can confirm Mathieus' measurement now:

Athlon64:

regular NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 79 cycles kfree -> 92 cycles
10000 times kmalloc(16) -> 79 cycles kfree -> 93 cycles
10000 times kmalloc(32) -> 88 cycles kfree -> 95 cycles
10000 times kmalloc(64) -> 124 cycles kfree -> 132 cycles
10000 times kmalloc(128) -> 157 cycles kfree -> 247 cycles
10000 times kmalloc(256) -> 200 cycles kfree -> 257 cycles
10000 times kmalloc(512) -> 250 cycles kfree -> 277 cycles
10000 times kmalloc(1024) -> 337 cycles kfree -> 314 cycles
10000 times kmalloc(2048) -> 365 cycles kfree -> 330 cycles
10000 times kmalloc(4096) -> 352 cycles kfree -> 240 cycles
10000 times kmalloc(8192) -> 456 cycles kfree -> 340 cycles
10000 times kmalloc(16384) -> 646 cycles kfree -> 471 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 124 cycles
10000 times kmalloc(16)/kfree -> 124 cycles
10000 times kmalloc(32)/kfree -> 124 cycles
10000 times kmalloc(64)/kfree -> 124 cycles
10000 times kmalloc(128)/kfree -> 124 cycles
10000 times kmalloc(256)/kfree -> 132 cycles
10000 times kmalloc(512)/kfree -> 132 cycles
10000 times kmalloc(1024)/kfree -> 132 cycles
10000 times kmalloc(2048)/kfree -> 132 cycles
10000 times kmalloc(4096)/kfree -> 319 cycles
10000 times kmalloc(8192)/kfree -> 486 cycles
10000 times kmalloc(16384)/kfree -> 539 cycles

cmpxchg_local NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 90 cycles
10000 times kmalloc(16) -> 55 cycles kfree -> 92 cycles
10000 times kmalloc(32) -> 70 cycles kfree -> 91 cycles
10000 times kmalloc(64) -> 100 cycles kfree -> 141 cycles
10000 times kmalloc(128) -> 128 cycles kfree -> 233 cycles
10000 times kmalloc(256) -> 172 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 225 cycles kfree -> 275 cycles
10000 times kmalloc(1024) -> 325 cycles kfree -> 311 cycles
10000 times kmalloc(2048) -> 346 cycles kfree -> 330 cycles
10000 times kmalloc(4096) -> 351 cycles kfree -> 238 cycles
10000 times kmalloc(8192) -> 450 cycles kfree -> 342 cycles
10000 times kmalloc(16384) -> 630 cycles kfree -> 546 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 81 cycles
10000 times kmalloc(16)/kfree -> 81 cycles
10000 times kmalloc(32)/kfree -> 81 cycles
10000 times kmalloc(64)/kfree -> 81 cycles
10000 times kmalloc(128)/kfree -> 81 cycles
10000 times kmalloc(256)/kfree -> 91 cycles
10000 times kmalloc(512)/kfree -> 90 cycles
10000 times kmalloc(1024)/kfree -> 91 cycles
10000 times kmalloc(2048)/kfree -> 90 cycles
10000 times kmalloc(4096)/kfree -> 318 cycles
10000 times kmalloc(8192)/kfree -> 483 cycles
10000 times kmalloc(16384)/kfree -> 536 cycles

Changelog:
- Ran though checkpatch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2008-02-07 08:42:30 -08:00

3 Commits