linux/mm/slab.c

4250 lines
107 KiB
C
Raw Normal View History

/*
* linux/mm/slab.c
* Written by Mark Hemment, 1996/97.
* (markhe@nextd.demon.co.uk)
*
* kmem_cache_destroy() + some cleanup - 1999 Andrea Arcangeli
*
* Major cleanup, different bufctl logic, per-cpu arrays
* (c) 2000 Manfred Spraul
*
* Cleanup, make the head arrays unconditional, preparation for NUMA
* (c) 2002 Manfred Spraul
*
* An implementation of the Slab Allocator as described in outline in;
* UNIX Internals: The New Frontiers by Uresh Vahalia
* Pub: Prentice Hall ISBN 0-13-101908-2
* or with a little more detail in;
* The Slab Allocator: An Object-Caching Kernel Memory Allocator
* Jeff Bonwick (Sun Microsystems).
* Presented at: USENIX Summer 1994 Technical Conference
*
* The memory is organized in caches, one cache for each object type.
* (e.g. inode_cache, dentry_cache, buffer_head, vm_area_struct)
* Each cache consists out of many slabs (they are small (usually one
* page long) and always contiguous), and each slab contains multiple
* initialized objects.
*
* This means, that your constructor is used only for newly allocated
* slabs and you must pass objects with the same initializations to
* kmem_cache_free.
*
* Each cache can only support one memory type (GFP_DMA, GFP_HIGHMEM,
* normal). If you need a special memory type, then must create a new
* cache for that memory type.
*
* In order to reduce fragmentation, the slabs are sorted in 3 groups:
* full slabs with 0 free objects
* partial slabs
* empty slabs with no allocated objects
*
* If partial slabs exist, then new allocations come from these slabs,
* otherwise from empty slabs or new slabs are allocated.
*
* kmem_cache_destroy() CAN CRASH if you try to allocate from the cache
* during kmem_cache_destroy(). The caller must prevent concurrent allocs.
*
* Each cache has a short per-cpu head array, most allocs
* and frees go into that array, and if that array overflows, then 1/2
* of the entries in the array are given back into the global cache.
* The head array is strictly LIFO and should improve the cache hit rates.
* On SMP, it additionally reduces the spinlock operations.
*
* The c_cpuarray may not be read with enabled local interrupts -
* it's changed with a smp_call_function().
*
* SMP synchronization:
* constructors and destructors are called without any locking.
* Several members in struct kmem_cache and struct slab never change, they
* are accessed without any locking.
* The per-cpu arrays are never accessed from the wrong cpu, no locking,
* and local interrupts are disabled so slab code is preempt-safe.
* The non-constant members are protected with a per-cache irq spinlock.
*
* Many thanks to Mark Hemment, who wrote another per-cpu slab patch
* in 2000 - many ideas in the current implementation are derived from
* his patch.
*
* Further notes from the original documentation:
*
* 11 April '97. Started multi-threading - markhe
* The global cache-chain is protected by the mutex 'slab_mutex'.
* The sem is only needed when accessing/extending the cache-chain, which
* can never happen inside an interrupt (kmem_cache_create(),
* kmem_cache_shrink() and kmem_cache_reap()).
*
* At present, each engine can be growing a cache. This should be blocked.
*
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
* 15 March 2005. NUMA slab allocator.
* Shai Fultheim <shai@scalex86.org>.
* Shobhit Dayal <shobhit@calsoftinc.com>
* Alok N Kataria <alokk@calsoftinc.com>
* Christoph Lameter <christoph@lameter.com>
*
* Modified the slab allocator to be node aware on NUMA systems.
* Each node has its own list of partial, free and full slabs.
* All object allocations for a node occur from node specific slab lists.
*/
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/poison.h>
#include <linux/swap.h>
#include <linux/cache.h>
#include <linux/interrupt.h>
#include <linux/init.h>
#include <linux/compiler.h>
[PATCH] cpuset memory spread slab cache implementation Provide the slab cache infrastructure to support cpuset memory spreading. See the previous patches, cpuset_mem_spread, for an explanation of cpuset memory spreading. This patch provides a slab cache SLAB_MEM_SPREAD flag. If set in the kmem_cache_create() call defining a slab cache, then any task marked with the process state flag PF_MEMSPREAD will spread memory page allocations for that cache over all the allowed nodes, instead of preferring the local (faulting) node. On systems not configured with CONFIG_NUMA, this results in no change to the page allocation code path for slab caches. On systems with cpusets configured in the kernel, but the "memory_spread" cpuset option not enabled for the current tasks cpuset, this adds a call to a cpuset routine and failed bit test of the processor state flag PF_SPREAD_SLAB. For tasks so marked, a second inline test is done for the slab cache flag SLAB_MEM_SPREAD, and if that is set and if the allocation is not in_interrupt(), this adds a call to to a cpuset routine that computes which of the tasks mems_allowed nodes should be preferred for this allocation. ==> This patch adds another hook into the performance critical code path to allocating objects from the slab cache, in the ____cache_alloc() chunk, below. The next patch optimizes this hook, reducing the impact of the combined mempolicy plus memory spreading hooks on this critical code path to a single check against the tasks task_struct flags word. This patch provides the generic slab flags and logic needed to apply memory spreading to a particular slab. A subsequent patch will mark a few specific slab caches for this placement policy. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:07 +08:00
#include <linux/cpuset.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/notifier.h>
#include <linux/kallsyms.h>
#include <linux/cpu.h>
#include <linux/sysctl.h>
#include <linux/module.h>
#include <linux/rcupdate.h>
#include <linux/string.h>
#include <linux/uaccess.h>
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#include <linux/nodemask.h>
#include <linux/kmemleak.h>
[PATCH] NUMA policies in the slab allocator V2 This patch fixes a regression in 2.6.14 against 2.6.13 that causes an imbalance in memory allocation during bootup. The slab allocator in 2.6.13 is not numa aware and simply calls alloc_pages(). This means that memory policies may control the behavior of alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE resulting in the spreading out of allocations during bootup over all available nodes. The slab allocator in 2.6.13 has only a single list of slab pages. As a result the per cpu slab cache and the spinlock controlled page lists may contain slab entries from off node memory. The slab allocator in 2.6.13 makes no effort to discern the locality of an entry on its lists. The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages explicitly by calling alloc_pages_node(). The NUMA slab allocator manages slab entries by having lists of available slab pages for each node. The per cpu slab cache can only contain slab entries associated with the node local to the processor. This guarantees that the default allocation mode of the slab allocator always assigns local memory if available. Setting MPOL_INTERLEAVE as a default policy during bootup has no effect anymore. In 2.6.14 all node unspecific slab allocations are performed on the boot processor. This means that most of key data structures are allocated on one node. Most processors will have to refer to these structures making the boot node a potential bottleneck. This may reduce performance and cause unnecessary memory pressure on the boot node. This patch implements NUMA policies in the slab layer. There is the need of explicit application of NUMA memory policies by the slab allcator itself since the NUMA slab allocator does no longer let the page_allocator control locality. The check for policies is made directly at the beginning of __cache_alloc using current->mempolicy. The memory policy is already frequently checked by the page allocator (alloc_page_vma() and alloc_page_current()). So it is highly likely that the cacheline is present. For MPOL_INTERLEAVE kmalloc() will spread out each request to one node after another so that an equal distribution of allocations can be obtained during bootup. It is not possible to push the policy check to lower layers of the NUMA slab allocator since the per cpu caches are now only containing slab entries from the current node. If the policy says that the local node is not to be preferred or forbidden then there is no point in checking the slab cache or local list of slab pages. The allocation better be directed immediately to the lists containing slab entries for the allowed set of nodes. This way of applying policy also fixes another strange behavior in 2.6.13. alloc_pages() is controlled by the memory allocation policy of the current process. It could therefore be that one process is running with MPOL_INTERLEAVE and would f.e. obtain a new page following that policy since no slab entries are in the lists anymore. A page can typically be used for multiple slab entries but lets say that the current process is only using one. The other entries are then added to the slab lists. These are now non local entries in the slab lists despite of the possible availability of local pages that would provide faster access and increase the performance of the application. Another process without MPOL_INTERLEAVE may now run and expect a local slab entry from kmalloc(). However, there are still these free slab entries from the off node page obtained from the other process via MPOL_INTERLEAVE in the cache. The process will then get an off node slab entry although other slab entries may be available that are local to that process. This means that the policy if one process may contaminate the locality of the slab caches for other processes. This patch in effect insures that a per process policy is followed for the allocation of slab entries and that there cannot be a memory policy influence from one process to another. A process with default policy will always get a local slab entry if one is available. And the process using memory policies will get its memory arranged as requested. Off-node slab allocation will require the use of spinlocks and will make the use of per cpu caches not possible. A process using memory policies to redirect allocations offnode will have to cope with additional lock overhead in addition to the latency added by the need to access a remote slab entry. Changes V1->V2 - Remove #ifdef CONFIG_NUMA by moving forward declaration into prior #ifdef CONFIG_NUMA section. - Give the function determining the node number to use a saner name. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-19 09:42:36 +08:00
#include <linux/mempolicy.h>
#include <linux/mutex.h>
#include <linux/fault-inject.h>
#include <linux/rtmutex.h>
2006-12-13 16:34:27 +08:00
#include <linux/reciprocal_div.h>
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
#include <linux/debugobjects.h>
#include <linux/kmemcheck.h>
#include <linux/memory.h>
#include <linux/prefetch.h>
#include <net/sock.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
#include <trace/events/kmem.h>
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
#include "internal.h"
#include "slab.h"
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
* 0 for faster, smaller code (especially in the critical paths).
*
* STATS - 1 to collect stats for /proc/slabinfo.
* 0 for faster, smaller code (especially in the critical paths).
*
* FORCED_DEBUG - 1 enables SLAB_RED_ZONE and SLAB_POISON (if possible)
*/
#ifdef CONFIG_DEBUG_SLAB
#define DEBUG 1
#define STATS 1
#define FORCED_DEBUG 1
#else
#define DEBUG 0
#define STATS 0
#define FORCED_DEBUG 0
#endif
/* Shouldn't this be in a header file somewhere? */
#define BYTES_PER_WORD sizeof(void *)
#define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long))
#ifndef ARCH_KMALLOC_FLAGS
#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
#endif
#define FREELIST_BYTE_INDEX (((PAGE_SIZE >> BITS_PER_BYTE) \
<= SLAB_OBJ_MIN_SIZE) ? 1 : 0)
#if FREELIST_BYTE_INDEX
typedef unsigned char freelist_idx_t;
#else
typedef unsigned short freelist_idx_t;
#endif
#define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
/*
* true if a page was allocated from pfmemalloc reserves for network-based
* swap
*/
static bool pfmemalloc_active __read_mostly;
/*
* struct array_cache
*
* Purpose:
* - LIFO ordering, to hand out cache-warm objects from _alloc
* - reduce the number of linked list operations
* - reduce spinlock operations
*
* The limit is stored in the per-cpu structure to reduce the data cache
* footprint.
*
*/
struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
unsigned int touched;
void *entry[]; /*
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
*
* Entries should not be directly dereferenced as
* entries belonging to slabs marked pfmemalloc will
* have the lower bits set SLAB_OBJ_PFMEMALLOC
*/
};
struct alien_cache {
spinlock_t lock;
struct array_cache ac;
};
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
#define SLAB_OBJ_PFMEMALLOC 1
static inline bool is_obj_pfmemalloc(void *objp)
{
return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
}
static inline void set_obj_pfmemalloc(void **objp)
{
*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
return;
}
static inline void clear_obj_pfmemalloc(void **objp)
{
*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
}
/*
* bootstrap: The caches do not work without cpuarrays anymore, but the
* cpuarrays are allocated from the generic caches...
*/
#define BOOT_CPUCACHE_ENTRIES 1
struct arraycache_init {
struct array_cache cache;
void *entries[BOOT_CPUCACHE_ENTRIES];
};
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/*
* Need this for bootstrapping a per node allocator.
*/
#define NUM_INIT_LISTS (2 * MAX_NUMNODES)
static struct kmem_cache_node __initdata init_kmem_cache_node[NUM_INIT_LISTS];
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#define CACHE_CACHE 0
#define SIZE_NODE (MAX_NUMNODES)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
static int drain_freelist(struct kmem_cache *cache,
struct kmem_cache_node *n, int tofree);
static void free_block(struct kmem_cache *cachep, void **objpp, int len,
int node, struct list_head *list);
static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list);
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
2006-11-22 22:55:48 +08:00
static void cache_reap(struct work_struct *unused);
static int slab_early_init = 1;
#define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
static void kmem_cache_node_init(struct kmem_cache_node *parent)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
INIT_LIST_HEAD(&parent->slabs_full);
INIT_LIST_HEAD(&parent->slabs_partial);
INIT_LIST_HEAD(&parent->slabs_free);
parent->shared = NULL;
parent->alien = NULL;
parent->colour_next = 0;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
spin_lock_init(&parent->list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
}
#define MAKE_LIST(cachep, listp, slab, nodeid) \
do { \
INIT_LIST_HEAD(listp); \
list_splice(&get_node(cachep, nodeid)->slab, listp); \
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
} while (0)
#define MAKE_ALL_LISTS(cachep, ptr, nodeid) \
do { \
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
MAKE_LIST((cachep), (&(ptr)->slabs_full), slabs_full, nodeid); \
MAKE_LIST((cachep), (&(ptr)->slabs_partial), slabs_partial, nodeid); \
MAKE_LIST((cachep), (&(ptr)->slabs_free), slabs_free, nodeid); \
} while (0)
#define CFLGS_OFF_SLAB (0x80000000UL)
#define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB)
mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:45:54 +08:00
#define OFF_SLAB_MIN_SIZE (max_t(size_t, PAGE_SIZE >> 5, KMALLOC_MIN_SIZE + 1))
#define BATCHREFILL_LIMIT 16
/*
* Optimization question: fewer reaps means less probability for unnessary
* cpucache drain/refill cycles.
*
2005-11-08 23:44:08 +08:00
* OTOH the cpuarrays can contain lots of objects,
* which could lock up otherwise freeable slabs.
*/
#define REAPTIMEOUT_AC (2*HZ)
#define REAPTIMEOUT_NODE (4*HZ)
#if STATS
#define STATS_INC_ACTIVE(x) ((x)->num_active++)
#define STATS_DEC_ACTIVE(x) ((x)->num_active--)
#define STATS_INC_ALLOCED(x) ((x)->num_allocations++)
#define STATS_INC_GROWN(x) ((x)->grown++)
#define STATS_ADD_REAPED(x,y) ((x)->reaped += (y))
#define STATS_SET_HIGH(x) \
do { \
if ((x)->num_active > (x)->high_mark) \
(x)->high_mark = (x)->num_active; \
} while (0)
#define STATS_INC_ERR(x) ((x)->errors++)
#define STATS_INC_NODEALLOCS(x) ((x)->node_allocs++)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#define STATS_INC_NODEFREES(x) ((x)->node_frees++)
#define STATS_INC_ACOVERFLOW(x) ((x)->node_overflow++)
#define STATS_SET_FREEABLE(x, i) \
do { \
if ((x)->max_freeable < i) \
(x)->max_freeable = i; \
} while (0)
#define STATS_INC_ALLOCHIT(x) atomic_inc(&(x)->allochit)
#define STATS_INC_ALLOCMISS(x) atomic_inc(&(x)->allocmiss)
#define STATS_INC_FREEHIT(x) atomic_inc(&(x)->freehit)
#define STATS_INC_FREEMISS(x) atomic_inc(&(x)->freemiss)
#else
#define STATS_INC_ACTIVE(x) do { } while (0)
#define STATS_DEC_ACTIVE(x) do { } while (0)
#define STATS_INC_ALLOCED(x) do { } while (0)
#define STATS_INC_GROWN(x) do { } while (0)
#define STATS_ADD_REAPED(x,y) do { (void)(y); } while (0)
#define STATS_SET_HIGH(x) do { } while (0)
#define STATS_INC_ERR(x) do { } while (0)
#define STATS_INC_NODEALLOCS(x) do { } while (0)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#define STATS_INC_NODEFREES(x) do { } while (0)
#define STATS_INC_ACOVERFLOW(x) do { } while (0)
#define STATS_SET_FREEABLE(x, i) do { } while (0)
#define STATS_INC_ALLOCHIT(x) do { } while (0)
#define STATS_INC_ALLOCMISS(x) do { } while (0)
#define STATS_INC_FREEHIT(x) do { } while (0)
#define STATS_INC_FREEMISS(x) do { } while (0)
#endif
#if DEBUG
/*
* memory layout of objects:
* 0 : objp
* 0 .. cachep->obj_offset - BYTES_PER_WORD - 1: padding. This ensures that
* the end of an object is aligned with the end of the real
* allocation. Catches writes behind the end of the allocation.
* cachep->obj_offset - BYTES_PER_WORD .. cachep->obj_offset - 1:
* redzone word.
* cachep->obj_offset: The real object.
* cachep->size - 2* BYTES_PER_WORD: redzone word [BYTES_PER_WORD long]
* cachep->size - 1* BYTES_PER_WORD: last caller address
* [BYTES_PER_WORD long]
*/
static int obj_offset(struct kmem_cache *cachep)
{
return cachep->obj_offset;
}
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
static unsigned long long *dbg_redzone1(struct kmem_cache *cachep, void *objp)
{
BUG_ON(!(cachep->flags & SLAB_RED_ZONE));
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
return (unsigned long long*) (objp + obj_offset(cachep) -
sizeof(unsigned long long));
}
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
static unsigned long long *dbg_redzone2(struct kmem_cache *cachep, void *objp)
{
BUG_ON(!(cachep->flags & SLAB_RED_ZONE));
if (cachep->flags & SLAB_STORE_USER)
return (unsigned long long *)(objp + cachep->size -
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
sizeof(unsigned long long) -
REDZONE_ALIGN);
return (unsigned long long *) (objp + cachep->size -
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
sizeof(unsigned long long));
}
static void **dbg_userword(struct kmem_cache *cachep, void *objp)
{
BUG_ON(!(cachep->flags & SLAB_STORE_USER));
return (void **)(objp + cachep->size - BYTES_PER_WORD);
}
#else
#define obj_offset(x) 0
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
#define dbg_redzone1(cachep, objp) ({BUG(); (unsigned long long *)NULL;})
#define dbg_redzone2(cachep, objp) ({BUG(); (unsigned long long *)NULL;})
#define dbg_userword(cachep, objp) ({BUG(); (void **)NULL;})
#endif
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
#define OBJECT_FREE (0)
#define OBJECT_ACTIVE (1)
#ifdef CONFIG_DEBUG_SLAB_LEAK
static void set_obj_status(struct page *page, int idx, int val)
{
int freelist_size;
char *status;
struct kmem_cache *cachep = page->slab_cache;
freelist_size = cachep->num * sizeof(freelist_idx_t);
status = (char *)page->freelist + freelist_size;
status[idx] = val;
}
static inline unsigned int get_obj_status(struct page *page, int idx)
{
int freelist_size;
char *status;
struct kmem_cache *cachep = page->slab_cache;
freelist_size = cachep->num * sizeof(freelist_idx_t);
status = (char *)page->freelist + freelist_size;
return status[idx];
}
#else
static inline void set_obj_status(struct page *page, int idx, int val) {}
#endif
/*
* Do not go above this order unless 0 objects fit into the slab or
* overridden on the command line.
*/
#define SLAB_MAX_ORDER_HI 1
#define SLAB_MAX_ORDER_LO 0
static int slab_max_order = SLAB_MAX_ORDER_LO;
static bool slab_max_order_set __initdata;
static inline struct kmem_cache *virt_to_cache(const void *obj)
{
struct page *page = virt_to_head_page(obj);
return page->slab_cache;
}
static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
unsigned int idx)
{
return page->s_mem + cache->size * idx;
}
2006-12-13 16:34:27 +08:00
/*
* We want to avoid an expensive divide : (offset / cache->size)
* Using the fact that size is a constant for a particular cache,
* we can replace (offset / cache->size) by
2006-12-13 16:34:27 +08:00
* reciprocal_divide(offset, cache->reciprocal_buffer_size)
*/
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
const struct page *page, void *obj)
{
u32 offset = (obj - page->s_mem);
2006-12-13 16:34:27 +08:00
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
}
/* internal cache of cache description objs */
static struct kmem_cache kmem_cache_boot = {
.batchcount = 1,
.limit = BOOT_CPUCACHE_ENTRIES,
.shared = 1,
.size = sizeof(struct kmem_cache),
.name = "kmem_cache",
};
Revert "slab: remove BAD_ALIEN_MAGIC" This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC"). commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the system with !CONFIG_NUMA has only one memory node. But, it turns out to be false by the report from Geert. His system, m68k, has many memory nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above change. Here goes his failure report. With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM: enable_cpucache failed for radix_tree_node, error 12. kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522! *** TRAP #7 *** FORMAT=0 Current process id is 0 BAD KERNEL TRAP: 00000000 Modules linked in: PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c SR: 2200 SP: 00345f90 a2: 0034c2e8 d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942 d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682 Process swapper (pid: 0, task=0034c2e8) Frame format=0 Stack from 00345fc4: 002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000 00000000 00000000 00000000 00000000 00000000 003ac942 00000000 003912d6 Call Trace: [<000360fa>] parse_args+0x0/0x2ca [<0017d806>] strlen+0x0/0x1a [<003921d4>] start_kernel+0x23c/0x428 [<003912d6>] _sinittext+0x2d6/0x95e Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c Disabling lock debugging due to kernel taint Kernel panic - not syncing: Attempted to kill the idle task! Although there is a alternative way to fix this issue such as disabling use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better to me in this time. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:15 +08:00
#define BAD_ALIEN_MAGIC 0x01020304ul
static DEFINE_PER_CPU(struct delayed_work, slab_reap_work);
static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
{
return this_cpu_ptr(cachep->cpu_cache);
}
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
static size_t calculate_freelist_size(int nr_objs, size_t align)
{
size_t freelist_size;
freelist_size = nr_objs * sizeof(freelist_idx_t);
if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK))
freelist_size += nr_objs * sizeof(char);
if (align)
freelist_size = ALIGN(freelist_size, align);
return freelist_size;
}
static int calculate_nr_objs(size_t slab_size, size_t buffer_size,
size_t idx_size, size_t align)
{
int nr_objs;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
size_t remained_size;
size_t freelist_size;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
int extra_space = 0;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK))
extra_space = sizeof(char);
/*
* Ignore padding for the initial guess. The padding
* is at most @align-1 bytes, and @buffer_size is at
* least @align. In the worst case, this result will
* be one greater than the number of objects that fit
* into the memory allocation when taking the padding
* into account.
*/
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
nr_objs = slab_size / (buffer_size + idx_size + extra_space);
/*
* This calculated number will be either the right
* amount, or one greater than what we want.
*/
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
remained_size = slab_size - nr_objs * buffer_size;
freelist_size = calculate_freelist_size(nr_objs, align);
if (remained_size < freelist_size)
nr_objs--;
return nr_objs;
}
/*
* Calculate the number of objects and left-over bytes for a given buffer size.
*/
static void cache_estimate(unsigned long gfporder, size_t buffer_size,
size_t align, int flags, size_t *left_over,
unsigned int *num)
{
int nr_objs;
size_t mgmt_size;
size_t slab_size = PAGE_SIZE << gfporder;
/*
* The slab management structure can be either off the slab or
* on it. For the latter case, the memory allocated for a
* slab is used for:
*
* - One unsigned int for each object
* - Padding to respect alignment of @align
* - @buffer_size bytes for each object
*
* If the slab management structure is off the slab, then the
* alignment will already be calculated into the size. Because
* the slabs are all pages aligned, the objects will be at the
* correct alignment when allocated.
*/
if (flags & CFLGS_OFF_SLAB) {
mgmt_size = 0;
nr_objs = slab_size / buffer_size;
} else {
nr_objs = calculate_nr_objs(slab_size, buffer_size,
slab: introduce byte sized index for the freelist of a slab Currently, the freelist of a slab consist of unsigned int sized indexes. Since most of slabs have less number of objects than 256, large sized indexes is needless. For example, consider the minimum kmalloc slab. It's object size is 32 byte and it would consist of one page, so 256 indexes through byte sized index are enough to contain all possible indexes. There can be some slabs whose object size is 8 byte. We cannot handle this case with byte sized index, so we need to restrict minimum object size. Since these slabs are not major, wasted memory from these slabs would be negligible. Some architectures' page size isn't 4096 bytes and rather larger than 4096 bytes (One example is 64KB page size on PPC or IA64) so that byte sized index doesn't fit to them. In this case, we will use two bytes sized index. Below is some number for this patch. * Before * kmalloc-512 525 640 512 8 1 : tunables 54 27 0 : slabdata 80 80 0 kmalloc-256 210 210 256 15 1 : tunables 120 60 0 : slabdata 14 14 0 kmalloc-192 1016 1040 192 20 1 : tunables 120 60 0 : slabdata 52 52 0 kmalloc-96 560 620 128 31 1 : tunables 120 60 0 : slabdata 20 20 0 kmalloc-64 2148 2280 64 60 1 : tunables 120 60 0 : slabdata 38 38 0 kmalloc-128 647 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11360 11413 32 113 1 : tunables 120 60 0 : slabdata 101 101 0 kmem_cache 197 200 192 20 1 : tunables 120 60 0 : slabdata 10 10 0 * After * kmalloc-512 521 648 512 8 1 : tunables 54 27 0 : slabdata 81 81 0 kmalloc-256 208 208 256 16 1 : tunables 120 60 0 : slabdata 13 13 0 kmalloc-192 1029 1029 192 21 1 : tunables 120 60 0 : slabdata 49 49 0 kmalloc-96 529 589 128 31 1 : tunables 120 60 0 : slabdata 19 19 0 kmalloc-64 2142 2142 64 63 1 : tunables 120 60 0 : slabdata 34 34 0 kmalloc-128 660 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11716 11780 32 124 1 : tunables 120 60 0 : slabdata 95 95 0 kmem_cache 197 210 192 21 1 : tunables 120 60 0 : slabdata 10 10 0 kmem_caches consisting of objects less than or equal to 256 byte have one or more objects than before. In the case of kmalloc-32, we have 11 more objects, so 352 bytes (11 * 32) are saved and this is roughly 9% saving of memory. Of couse, this percentage decreases as the number of objects in a slab decreases. Here are the performance results on my 4 cpus machine. * Before * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 229,945,138 cache-misses ( +- 0.23% ) 11.627897174 seconds time elapsed ( +- 0.14% ) * After * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 218,640,472 cache-misses ( +- 0.42% ) 11.504999837 seconds time elapsed ( +- 0.21% ) cache-misses are reduced by this patchset, roughly 5%. And elapsed times are improved by 1%. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-12-02 16:49:42 +08:00
sizeof(freelist_idx_t), align);
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
mgmt_size = calculate_freelist_size(nr_objs, align);
}
*num = nr_objs;
*left_over = slab_size - nr_objs*buffer_size - mgmt_size;
}
#if DEBUG
#define slab_error(cachep, msg) __slab_error(__func__, cachep, msg)
static void __slab_error(const char *function, struct kmem_cache *cachep,
char *msg)
{
printk(KERN_ERR "slab error in %s(): cache `%s': %s\n",
function, cachep->name, msg);
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}
#endif
/*
* By default on NUMA we use alien caches to stage the freeing of
* objects allocated from other nodes. This causes massive memory
* inefficiencies when using fake NUMA setup to split memory into a
* large number of small nodes, so it can be disabled on the command
* line
*/
static int use_alien_caches __read_mostly = 1;
static int __init noaliencache_setup(char *s)
{
use_alien_caches = 0;
return 1;
}
__setup("noaliencache", noaliencache_setup);
static int __init slab_max_order_setup(char *str)
{
get_option(&str, &slab_max_order);
slab_max_order = slab_max_order < 0 ? 0 :
min(slab_max_order, MAX_ORDER - 1);
slab_max_order_set = true;
return 1;
}
__setup("slab_max_order=", slab_max_order_setup);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
#ifdef CONFIG_NUMA
/*
* Special reaping functions for NUMA systems called from cache_reap().
* These take care of doing round robin flushing of alien caches (containing
* objects freed on different nodes from which they were allocated) and the
* flushing of remote pcps by calling drain_node_pages.
*/
static DEFINE_PER_CPU(unsigned long, slab_reap_node);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
static void init_reap_node(int cpu)
{
int node;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
node = next_node(cpu_to_mem(cpu), node_online_map);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
if (node == MAX_NUMNODES)
node = first_node(node_online_map);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
per_cpu(slab_reap_node, cpu) = node;
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
}
static void next_reap_node(void)
{
int node = __this_cpu_read(slab_reap_node);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
node = next_node(node, node_online_map);
if (unlikely(node >= MAX_NUMNODES))
node = first_node(node_online_map);
__this_cpu_write(slab_reap_node, node);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
}
#else
#define init_reap_node(cpu) do { } while (0)
#define next_reap_node(void) do { } while (0)
#endif
/*
* Initiate the reap timer running on the target CPU. We run at around 1 to 2Hz
* via the workqueue/eventd.
* Add the CPU number into the expiration time to minimize the possibility of
* the CPUs getting into lockstep and contending for the global cache chain
* lock.
*/
static void start_cpu_timer(int cpu)
{
struct delayed_work *reap_work = &per_cpu(slab_reap_work, cpu);
/*
* When this gets called from do_initcalls via cpucache_init(),
* init_workqueues() has already run, so keventd will be setup
* at that time.
*/
if (keventd_up() && reap_work->work.func == NULL) {
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
init_reap_node(cpu);
INIT_DEFERRABLE_WORK(reap_work, cache_reap);
schedule_delayed_work_on(cpu, reap_work,
__round_jiffies_relative(HZ, cpu));
}
}
static void init_arraycache(struct array_cache *ac, int limit, int batch)
{
/*
* The array_cache structures contain pointers to free object.
* However, when such objects are allocated or transferred to another
* cache the pointers are not cleared and they could be counted as
* valid references during a kmemleak scan. Therefore, kmemleak must
* not scan such objects.
*/
kmemleak_no_scan(ac);
if (ac) {
ac->avail = 0;
ac->limit = limit;
ac->batchcount = batch;
ac->touched = 0;
}
}
static struct array_cache *alloc_arraycache(int node, int entries,
int batchcount, gfp_t gfp)
{
size_t memsize = sizeof(void *) * entries + sizeof(struct array_cache);
struct array_cache *ac = NULL;
ac = kmalloc_node(memsize, gfp, node);
init_arraycache(ac, entries, batchcount);
return ac;
}
static inline bool is_slab_pfmemalloc(struct page *page)
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
{
return PageSlabPfmemalloc(page);
}
/* Clears pfmemalloc_active if no slabs have pfmalloc set */
static void recheck_pfmemalloc_active(struct kmem_cache *cachep,
struct array_cache *ac)
{
struct kmem_cache_node *n = get_node(cachep, numa_mem_id());
struct page *page;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
unsigned long flags;
if (!pfmemalloc_active)
return;
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->slabs_full, lru)
if (is_slab_pfmemalloc(page))
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
goto out;
list_for_each_entry(page, &n->slabs_partial, lru)
if (is_slab_pfmemalloc(page))
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
goto out;
list_for_each_entry(page, &n->slabs_free, lru)
if (is_slab_pfmemalloc(page))
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
goto out;
pfmemalloc_active = false;
out:
spin_unlock_irqrestore(&n->list_lock, flags);
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
}
static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
gfp_t flags, bool force_refill)
{
int i;
void *objp = ac->entry[--ac->avail];
/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
if (unlikely(is_obj_pfmemalloc(objp))) {
struct kmem_cache_node *n;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
if (gfp_pfmemalloc_allowed(flags)) {
clear_obj_pfmemalloc(&objp);
return objp;
}
/* The caller cannot use PFMEMALLOC objects, find another one */
for (i = 0; i < ac->avail; i++) {
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
/* If a !PFMEMALLOC object is found, swap them */
if (!is_obj_pfmemalloc(ac->entry[i])) {
objp = ac->entry[i];
ac->entry[i] = ac->entry[ac->avail];
ac->entry[ac->avail] = objp;
return objp;
}
}
/*
* If there are empty slabs on the slabs_free list and we are
* being forced to refill the cache, mark this one !pfmemalloc.
*/
n = get_node(cachep, numa_mem_id());
if (!list_empty(&n->slabs_free) && force_refill) {
struct page *page = virt_to_head_page(objp);
ClearPageSlabPfmemalloc(page);
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
clear_obj_pfmemalloc(&objp);
recheck_pfmemalloc_active(cachep, ac);
return objp;
}
/* No !PFMEMALLOC objects available */
ac->avail++;
objp = NULL;
}
return objp;
}
static inline void *ac_get_obj(struct kmem_cache *cachep,
struct array_cache *ac, gfp_t flags, bool force_refill)
{
void *objp;
if (unlikely(sk_memalloc_socks()))
objp = __ac_get_obj(cachep, ac, flags, force_refill);
else
objp = ac->entry[--ac->avail];
return objp;
}
static noinline void *__ac_put_obj(struct kmem_cache *cachep,
struct array_cache *ac, void *objp)
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
{
if (unlikely(pfmemalloc_active)) {
/* Some pfmemalloc slabs exist, check if this is one */
struct page *page = virt_to_head_page(objp);
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
if (PageSlabPfmemalloc(page))
set_obj_pfmemalloc(&objp);
}
return objp;
}
static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
if (unlikely(sk_memalloc_socks()))
objp = __ac_put_obj(cachep, ac, objp);
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
ac->entry[ac->avail++] = objp;
}
/*
* Transfer objects in one arraycache to another.
* Locking must be handled by the caller.
*
* Return the number of entries transferred.
*/
static int transfer_objects(struct array_cache *to,
struct array_cache *from, unsigned int max)
{
/* Figure out how many entries to transfer */
int nr = min3(from->avail, max, to->limit - to->avail);
if (!nr)
return 0;
memcpy(to->entry + to->avail, from->entry + from->avail -nr,
sizeof(void *) *nr);
from->avail -= nr;
to->avail += nr;
return nr;
}
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
#ifndef CONFIG_NUMA
#define drain_alien_cache(cachep, alien) do { } while (0)
#define reap_alien(cachep, n) do { } while (0)
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
static inline struct alien_cache **alloc_alien_cache(int node,
int limit, gfp_t gfp)
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
{
Revert "slab: remove BAD_ALIEN_MAGIC" This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC"). commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the system with !CONFIG_NUMA has only one memory node. But, it turns out to be false by the report from Geert. His system, m68k, has many memory nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above change. Here goes his failure report. With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM: enable_cpucache failed for radix_tree_node, error 12. kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522! *** TRAP #7 *** FORMAT=0 Current process id is 0 BAD KERNEL TRAP: 00000000 Modules linked in: PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c SR: 2200 SP: 00345f90 a2: 0034c2e8 d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942 d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682 Process swapper (pid: 0, task=0034c2e8) Frame format=0 Stack from 00345fc4: 002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000 00000000 00000000 00000000 00000000 00000000 003ac942 00000000 003912d6 Call Trace: [<000360fa>] parse_args+0x0/0x2ca [<0017d806>] strlen+0x0/0x1a [<003921d4>] start_kernel+0x23c/0x428 [<003912d6>] _sinittext+0x2d6/0x95e Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c Disabling lock debugging due to kernel taint Kernel panic - not syncing: Attempted to kill the idle task! Although there is a alternative way to fix this issue such as disabling use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better to me in this time. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:15 +08:00
return (struct alien_cache **)BAD_ALIEN_MAGIC;
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
}
static inline void free_alien_cache(struct alien_cache **ac_ptr)
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
{
}
static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
return 0;
}
static inline void *alternate_node_alloc(struct kmem_cache *cachep,
gfp_t flags)
{
return NULL;
}
static inline void *____cache_alloc_node(struct kmem_cache *cachep,
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
gfp_t flags, int nodeid)
{
return NULL;
}
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
static inline gfp_t gfp_exact_node(gfp_t flags)
{
return flags;
}
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
#else /* CONFIG_NUMA */
static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
[PATCH] NUMA policies in the slab allocator V2 This patch fixes a regression in 2.6.14 against 2.6.13 that causes an imbalance in memory allocation during bootup. The slab allocator in 2.6.13 is not numa aware and simply calls alloc_pages(). This means that memory policies may control the behavior of alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE resulting in the spreading out of allocations during bootup over all available nodes. The slab allocator in 2.6.13 has only a single list of slab pages. As a result the per cpu slab cache and the spinlock controlled page lists may contain slab entries from off node memory. The slab allocator in 2.6.13 makes no effort to discern the locality of an entry on its lists. The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages explicitly by calling alloc_pages_node(). The NUMA slab allocator manages slab entries by having lists of available slab pages for each node. The per cpu slab cache can only contain slab entries associated with the node local to the processor. This guarantees that the default allocation mode of the slab allocator always assigns local memory if available. Setting MPOL_INTERLEAVE as a default policy during bootup has no effect anymore. In 2.6.14 all node unspecific slab allocations are performed on the boot processor. This means that most of key data structures are allocated on one node. Most processors will have to refer to these structures making the boot node a potential bottleneck. This may reduce performance and cause unnecessary memory pressure on the boot node. This patch implements NUMA policies in the slab layer. There is the need of explicit application of NUMA memory policies by the slab allcator itself since the NUMA slab allocator does no longer let the page_allocator control locality. The check for policies is made directly at the beginning of __cache_alloc using current->mempolicy. The memory policy is already frequently checked by the page allocator (alloc_page_vma() and alloc_page_current()). So it is highly likely that the cacheline is present. For MPOL_INTERLEAVE kmalloc() will spread out each request to one node after another so that an equal distribution of allocations can be obtained during bootup. It is not possible to push the policy check to lower layers of the NUMA slab allocator since the per cpu caches are now only containing slab entries from the current node. If the policy says that the local node is not to be preferred or forbidden then there is no point in checking the slab cache or local list of slab pages. The allocation better be directed immediately to the lists containing slab entries for the allowed set of nodes. This way of applying policy also fixes another strange behavior in 2.6.13. alloc_pages() is controlled by the memory allocation policy of the current process. It could therefore be that one process is running with MPOL_INTERLEAVE and would f.e. obtain a new page following that policy since no slab entries are in the lists anymore. A page can typically be used for multiple slab entries but lets say that the current process is only using one. The other entries are then added to the slab lists. These are now non local entries in the slab lists despite of the possible availability of local pages that would provide faster access and increase the performance of the application. Another process without MPOL_INTERLEAVE may now run and expect a local slab entry from kmalloc(). However, there are still these free slab entries from the off node page obtained from the other process via MPOL_INTERLEAVE in the cache. The process will then get an off node slab entry although other slab entries may be available that are local to that process. This means that the policy if one process may contaminate the locality of the slab caches for other processes. This patch in effect insures that a per process policy is followed for the allocation of slab entries and that there cannot be a memory policy influence from one process to another. A process with default policy will always get a local slab entry if one is available. And the process using memory policies will get its memory arranged as requested. Off-node slab allocation will require the use of spinlocks and will make the use of per cpu caches not possible. A process using memory policies to redirect allocations offnode will have to cope with additional lock overhead in addition to the latency added by the need to access a remote slab entry. Changes V1->V2 - Remove #ifdef CONFIG_NUMA by moving forward declaration into prior #ifdef CONFIG_NUMA section. - Give the function determining the node number to use a saner name. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-19 09:42:36 +08:00
static struct alien_cache *__alloc_alien_cache(int node, int entries,
int batch, gfp_t gfp)
{
size_t memsize = sizeof(void *) * entries + sizeof(struct alien_cache);
struct alien_cache *alc = NULL;
alc = kmalloc_node(memsize, gfp, node);
init_arraycache(&alc->ac, entries, batch);
spin_lock_init(&alc->lock);
return alc;
}
static struct alien_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
struct alien_cache **alc_ptr;
size_t memsize = sizeof(void *) * nr_node_ids;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
int i;
if (limit > 1)
limit = 12;
alc_ptr = kzalloc_node(memsize, gfp, node);
if (!alc_ptr)
return NULL;
for_each_node(i) {
if (i == node || !node_online(i))
continue;
alc_ptr[i] = __alloc_alien_cache(node, limit, 0xbaadf00d, gfp);
if (!alc_ptr[i]) {
for (i--; i >= 0; i--)
kfree(alc_ptr[i]);
kfree(alc_ptr);
return NULL;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
}
return alc_ptr;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
static void free_alien_cache(struct alien_cache **alc_ptr)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
int i;
if (!alc_ptr)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
return;
for_each_node(i)
kfree(alc_ptr[i]);
kfree(alc_ptr);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
static void __drain_alien_cache(struct kmem_cache *cachep,
struct array_cache *ac, int node,
struct list_head *list)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
struct kmem_cache_node *n = get_node(cachep, node);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
if (ac->avail) {
spin_lock(&n->list_lock);
/*
* Stuff objects into the remote nodes shared array first.
* That way we could avoid the overhead of putting the objects
* into the free lists and getting them back later.
*/
if (n->shared)
transfer_objects(n->shared, ac, ac->limit);
free_block(cachep, ac->entry, ac->avail, node, list);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
ac->avail = 0;
spin_unlock(&n->list_lock);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
}
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
/*
* Called from cache_reap() to regularly drain alien caches round robin.
*/
static void reap_alien(struct kmem_cache *cachep, struct kmem_cache_node *n)
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
{
int node = __this_cpu_read(slab_reap_node);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
if (n->alien) {
struct alien_cache *alc = n->alien[node];
struct array_cache *ac;
if (alc) {
ac = &alc->ac;
if (ac->avail && spin_trylock_irq(&alc->lock)) {
LIST_HEAD(list);
__drain_alien_cache(cachep, ac, node, &list);
spin_unlock_irq(&alc->lock);
slabs_destroy(cachep, &list);
}
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
}
}
}
static void drain_alien_cache(struct kmem_cache *cachep,
struct alien_cache **alien)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
int i = 0;
struct alien_cache *alc;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
struct array_cache *ac;
unsigned long flags;
for_each_online_node(i) {
alc = alien[i];
if (alc) {
LIST_HEAD(list);
ac = &alc->ac;
spin_lock_irqsave(&alc->lock, flags);
__drain_alien_cache(cachep, ac, i, &list);
spin_unlock_irqrestore(&alc->lock, flags);
slabs_destroy(cachep, &list);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
}
}
static int __cache_free_alien(struct kmem_cache *cachep, void *objp,
int node, int page_node)
{
struct kmem_cache_node *n;
struct alien_cache *alien = NULL;
struct array_cache *ac;
LIST_HEAD(list);
n = get_node(cachep, node);
STATS_INC_NODEFREES(cachep);
if (n->alien && n->alien[page_node]) {
alien = n->alien[page_node];
ac = &alien->ac;
spin_lock(&alien->lock);
if (unlikely(ac->avail == ac->limit)) {
STATS_INC_ACOVERFLOW(cachep);
__drain_alien_cache(cachep, ac, page_node, &list);
}
ac_put_obj(cachep, ac, objp);
spin_unlock(&alien->lock);
slabs_destroy(cachep, &list);
} else {
n = get_node(cachep, page_node);
spin_lock(&n->list_lock);
free_block(cachep, &objp, 1, page_node, &list);
spin_unlock(&n->list_lock);
slabs_destroy(cachep, &list);
}
return 1;
}
static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
int page_node = page_to_nid(virt_to_page(objp));
int node = numa_mem_id();
/*
* Make sure we are not freeing a object from another node to the array
* cache on this cpu.
*/
if (likely(node == page_node))
return 0;
return __cache_free_alien(cachep, objp, node, page_node);
}
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
/*
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
* Construct gfp mask to allocate from a specific node but do not direct reclaim
* or warn about failures. kswapd may still wake to reclaim in the background.
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
*/
static inline gfp_t gfp_exact_node(gfp_t flags)
{
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#endif
/*
* Allocates and initializes node for a node on each slab cache, used for
* either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node
* will be allocated off-node since memory is not yet online for the new node.
* When hotplugging memory or a cpu, existing node are not replaced if
* already in use.
*
* Must hold slab_mutex.
*/
static int init_cache_node_node(int node)
{
struct kmem_cache *cachep;
struct kmem_cache_node *n;
const size_t memsize = sizeof(struct kmem_cache_node);
list_for_each_entry(cachep, &slab_caches, list) {
/*
* Set up the kmem_cache_node for cpu before we can
* begin anything. Make sure some other cpu on this
* node has not already allocated this
*/
n = get_node(cachep, node);
if (!n) {
n = kmalloc_node(memsize, GFP_KERNEL, node);
if (!n)
return -ENOMEM;
kmem_cache_node_init(n);
n->next_reap = jiffies + REAPTIMEOUT_NODE +
((unsigned long)cachep) % REAPTIMEOUT_NODE;
/*
* The kmem_cache_nodes don't come and go as CPUs
* come and go. slab_mutex is sufficient
* protection here.
*/
cachep->node[node] = n;
}
spin_lock_irq(&n->list_lock);
n->free_limit =
(1 + nr_cpus_node(node)) *
cachep->batchcount + cachep->num;
spin_unlock_irq(&n->list_lock);
}
return 0;
}
static inline int slabs_tofree(struct kmem_cache *cachep,
struct kmem_cache_node *n)
{
return (n->free_objects + cachep->num - 1) / cachep->num;
}
static void cpuup_canceled(long cpu)
{
struct kmem_cache *cachep;
struct kmem_cache_node *n = NULL;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int node = cpu_to_mem(cpu);
const struct cpumask *mask = cpumask_of_node(node);
list_for_each_entry(cachep, &slab_caches, list) {
struct array_cache *nc;
struct array_cache *shared;
struct alien_cache **alien;
LIST_HEAD(list);
n = get_node(cachep, node);
if (!n)
continue;
spin_lock_irq(&n->list_lock);
/* Free limit for this kmem_cache_node */
n->free_limit -= cachep->batchcount;
/* cpu is dead; no one can alloc from it. */
nc = per_cpu_ptr(cachep->cpu_cache, cpu);
if (nc) {
free_block(cachep, nc->entry, nc->avail, node, &list);
nc->avail = 0;
}
if (!cpumask_empty(mask)) {
spin_unlock_irq(&n->list_lock);
goto free_slab;
}
shared = n->shared;
if (shared) {
free_block(cachep, shared->entry,
shared->avail, node, &list);
n->shared = NULL;
}
alien = n->alien;
n->alien = NULL;
spin_unlock_irq(&n->list_lock);
kfree(shared);
if (alien) {
drain_alien_cache(cachep, alien);
free_alien_cache(alien);
}
free_slab:
slabs_destroy(cachep, &list);
}
/*
* In the previous loop, all the objects were freed to
* the respective cache's slabs, now we can go ahead and
* shrink each nodelist to its limit.
*/
list_for_each_entry(cachep, &slab_caches, list) {
n = get_node(cachep, node);
if (!n)
continue;
drain_freelist(cachep, n, slabs_tofree(cachep, n));
}
}
static int cpuup_prepare(long cpu)
{
struct kmem_cache *cachep;
struct kmem_cache_node *n = NULL;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int node = cpu_to_mem(cpu);
int err;
/*
* We need to do this right in the beginning since
* alloc_arraycache's are going to use this list.
* kmalloc_node allows us to add the slab to the right
* kmem_cache_node and not this cpu's kmem_cache_node
*/
err = init_cache_node_node(node);
if (err < 0)
goto bad;
/*
* Now we can go ahead with allocating the shared arrays and
* array caches
*/
list_for_each_entry(cachep, &slab_caches, list) {
struct array_cache *shared = NULL;
struct alien_cache **alien = NULL;
if (cachep->shared) {
shared = alloc_arraycache(node,
cachep->shared * cachep->batchcount,
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
0xbaadf00d, GFP_KERNEL);
if (!shared)
goto bad;
}
if (use_alien_caches) {
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL);
if (!alien) {
kfree(shared);
goto bad;
}
}
n = get_node(cachep, node);
BUG_ON(!n);
spin_lock_irq(&n->list_lock);
if (!n->shared) {
/*
* We are serialised from CPU_DEAD or
* CPU_UP_CANCELLED by the cpucontrol lock
*/
n->shared = shared;
shared = NULL;
}
[PATCH] NUMA slab locking fixes: fix cpu down and up locking This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 15:27:59 +08:00
#ifdef CONFIG_NUMA
if (!n->alien) {
n->alien = alien;
alien = NULL;
}
#endif
spin_unlock_irq(&n->list_lock);
kfree(shared);
free_alien_cache(alien);
}
SLAB: Fix lockdep annotations for CPU hotplug As reported by Paul McKenney: I am seeing some lockdep complaints in rcutorture runs that include frequent CPU-hotplug operations. The tests are otherwise successful. My first thought was to send a patch that gave each array_cache structure's ->lock field its own struct lock_class_key, but you already have a init_lock_keys() that seems to be intended to deal with this. ------------------------------------------------------------------------ ============================================= [ INFO: possible recursive locking detected ] 2.6.32-rc4-autokern1 #1 --------------------------------------------- syslogd/2908 is trying to acquire lock: (&nc->lock){..-...}, at: [<c0000000001407f4>] .kmem_cache_free+0x118/0x2d4 but task is already holding lock: (&nc->lock){..-...}, at: [<c0000000001411bc>] .kfree+0x1f0/0x324 other info that might help us debug this: 3 locks held by syslogd/2908: #0: (&u->readlock){+.+.+.}, at: [<c0000000004556f8>] .unix_dgram_recvmsg+0x70/0x338 #1: (&nc->lock){..-...}, at: [<c0000000001411bc>] .kfree+0x1f0/0x324 #2: (&parent->list_lock){-.-...}, at: [<c000000000140f64>] .__drain_alien_cache+0x50/0xb8 stack backtrace: Call Trace: [c0000000e8ccafc0] [c0000000000101e4] .show_stack+0x70/0x184 (unreliable) [c0000000e8ccb070] [c0000000000afebc] .validate_chain+0x6ec/0xf58 [c0000000e8ccb180] [c0000000000b0ff0] .__lock_acquire+0x8c8/0x974 [c0000000e8ccb280] [c0000000000b2290] .lock_acquire+0x140/0x18c [c0000000e8ccb350] [c000000000468df0] ._spin_lock+0x48/0x70 [c0000000e8ccb3e0] [c0000000001407f4] .kmem_cache_free+0x118/0x2d4 [c0000000e8ccb4a0] [c000000000140b90] .free_block+0x130/0x1a8 [c0000000e8ccb540] [c000000000140f94] .__drain_alien_cache+0x80/0xb8 [c0000000e8ccb5e0] [c0000000001411e0] .kfree+0x214/0x324 [c0000000e8ccb6a0] [c0000000003ca860] .skb_release_data+0xe8/0x104 [c0000000e8ccb730] [c0000000003ca2ec] .__kfree_skb+0x20/0xd4 [c0000000e8ccb7b0] [c0000000003cf2c8] .skb_free_datagram+0x1c/0x5c [c0000000e8ccb830] [c00000000045597c] .unix_dgram_recvmsg+0x2f4/0x338 [c0000000e8ccb920] [c0000000003c0f14] .sock_recvmsg+0xf4/0x13c [c0000000e8ccbb30] [c0000000003c28ec] .SyS_recvfrom+0xb4/0x130 [c0000000e8ccbcb0] [c0000000003bfb78] .sys_recv+0x18/0x2c [c0000000e8ccbd20] [c0000000003ed388] .compat_sys_recv+0x14/0x28 [c0000000e8ccbd90] [c0000000003ee1bc] .compat_sys_socketcall+0x178/0x220 [c0000000e8ccbe30] [c0000000000085d4] syscall_exit+0x0/0x40 This patch fixes the issue by setting up lockdep annotations during CPU hotplug. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-11-24 04:01:15 +08:00
return 0;
bad:
cpuup_canceled(cpu);
return -ENOMEM;
}
static int cpuup_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
int err = 0;
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
mutex_lock(&slab_mutex);
err = cpuup_prepare(cpu);
mutex_unlock(&slab_mutex);
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
start_cpu_timer(cpu);
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
/*
* Shutdown cache reaper. Note that the slab_mutex is
* held so that if cache_reap() is invoked it cannot do
* anything expensive but will only modify reap_work
* and reschedule the timer.
*/
cancel_delayed_work_sync(&per_cpu(slab_reap_work, cpu));
/* Now the cache_reaper is guaranteed to be not running. */
per_cpu(slab_reap_work, cpu).work.func = NULL;
break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
start_cpu_timer(cpu);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
[PATCH] NUMA slab locking fixes: fix cpu down and up locking This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 15:27:59 +08:00
/*
* Even if all the cpus of a node are down, we don't free the
* kmem_cache_node of any cache. This to avoid a race between
[PATCH] NUMA slab locking fixes: fix cpu down and up locking This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 15:27:59 +08:00
* cpu_down, and a kmalloc allocation from another cpu for
* memory from the node of the cpu going down. The node
[PATCH] NUMA slab locking fixes: fix cpu down and up locking This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 15:27:59 +08:00
* structure is usually allocated from kmem_cache_create() and
* gets destroyed at kmem_cache_destroy().
*/
/* fall through */
[PATCH] mm: slab: eliminate lock_cpu_hotplug from slab Here's an attempt towards doing away with lock_cpu_hotplug in the slab subsystem. This approach also fixes a bug which shows up when cpus are being offlined/onlined and slab caches are being tuned simultaneously. http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2 The patch has been stress tested overnight on a 2 socket 4 core AMD box with repeated cpu online and offline, while dbench and kernbench process are running, and slab caches being tuned at the same time. There were no lockdep warnings either. (This test on 2,6.18 as 2.6.19-rc crashes at __drain_pages http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 ) The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until CPU_ONLINE (similar in approach as worqueue_mutex) . Slab code sensitive to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write, __cache_shrink) is already serialized with cache_chain_mutex. (This patch lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this). This patch also takes the cache_chain_sem at kmem_cache_shrink to protect sanity of cpu_online_map at __cache_shrink, as viewed by slab. (kmem_cache_shrink->__cache_shrink->drain_cpu_caches). But, really, kmem_cache_shrink is used at just one place in the acpi subsystem! Do we really need to keep kmem_cache_shrink at all? Another note. Looks like a cpu hotplug event can send CPU_UP_CANCELED to a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE. This could be due to a subsystem registered for notification earlier than the current subsystem crapping out with NOTIFY_BAD. Badness can occur with in the CPU_UP_CANCELED code path at slab if this happens (The same would apply for workqueue.c as well). To overcome this, we might have to use either a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was using in his experiments, or c) Do not send CPU_UP_CANCELED to a subsystem which did not receive CPU_UP_PREPARE. I would prefer c). Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:32:14 +08:00
#endif
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
mutex_lock(&slab_mutex);
cpuup_canceled(cpu);
mutex_unlock(&slab_mutex);
break;
}
return notifier_from_errno(err);
}
static struct notifier_block cpucache_notifier = {
&cpuup_callback, NULL, 0
};
#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
/*
* Drains freelist for a node on each slab cache, used for memory hot-remove.
* Returns -EBUSY if all objects cannot be drained so that the node is not
* removed.
*
* Must hold slab_mutex.
*/
static int __meminit drain_cache_node_node(int node)
{
struct kmem_cache *cachep;
int ret = 0;
list_for_each_entry(cachep, &slab_caches, list) {
struct kmem_cache_node *n;
n = get_node(cachep, node);
if (!n)
continue;
drain_freelist(cachep, n, slabs_tofree(cachep, n));
if (!list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial)) {
ret = -EBUSY;
break;
}
}
return ret;
}
static int __meminit slab_memory_callback(struct notifier_block *self,
unsigned long action, void *arg)
{
struct memory_notify *mnb = arg;
int ret = 0;
int nid;
nid = mnb->status_change_nid;
if (nid < 0)
goto out;
switch (action) {
case MEM_GOING_ONLINE:
mutex_lock(&slab_mutex);
ret = init_cache_node_node(nid);
mutex_unlock(&slab_mutex);
break;
case MEM_GOING_OFFLINE:
mutex_lock(&slab_mutex);
ret = drain_cache_node_node(nid);
mutex_unlock(&slab_mutex);
break;
case MEM_ONLINE:
case MEM_OFFLINE:
case MEM_CANCEL_ONLINE:
case MEM_CANCEL_OFFLINE:
break;
}
out:
return notifier_from_errno(ret);
}
#endif /* CONFIG_NUMA && CONFIG_MEMORY_HOTPLUG */
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/*
* swap the static kmem_cache_node with kmalloced memory
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
*/
static void __init init_list(struct kmem_cache *cachep, struct kmem_cache_node *list,
int nodeid)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
struct kmem_cache_node *ptr;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
ptr = kmalloc_node(sizeof(struct kmem_cache_node), GFP_NOWAIT, nodeid);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
BUG_ON(!ptr);
memcpy(ptr, list, sizeof(struct kmem_cache_node));
/*
* Do not assume that spinlocks can be initialized via memcpy:
*/
spin_lock_init(&ptr->list_lock);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
MAKE_ALL_LISTS(cachep, ptr, nodeid);
cachep->node[nodeid] = ptr;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
/*
* For setting up all the kmem_cache_node for cache whose buffer_size is same as
* size of kmem_cache_node.
*/
static void __init set_up_node(struct kmem_cache *cachep, int index)
{
int node;
for_each_online_node(node) {
cachep->node[node] = &init_kmem_cache_node[index + node];
cachep->node[node]->next_reap = jiffies +
REAPTIMEOUT_NODE +
((unsigned long)cachep) % REAPTIMEOUT_NODE;
}
}
/*
* Initialisation. Called after the page allocator have been initialised and
* before smp_init().
*/
void __init kmem_cache_init(void)
{
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
int i;
BUILD_BUG_ON(sizeof(((struct page *)NULL)->lru) <
sizeof(struct rcu_head));
kmem_cache = &kmem_cache_boot;
if (num_possible_nodes() == 1)
use_alien_caches = 0;
for (i = 0; i < NUM_INIT_LISTS; i++)
kmem_cache_node_init(&init_kmem_cache_node[i]);
/*
* Fragmentation resistance on low memory - only use bigger
* page orders on machines with more than 32MB of memory if
* not overridden on the command line.
*/
if (!slab_max_order_set && totalram_pages > (32 << 20) >> PAGE_SHIFT)
slab_max_order = SLAB_MAX_ORDER_HI;
/* Bootstrap is tricky, because several objects are allocated
* from caches that do not exist yet:
* 1) initialize the kmem_cache cache: it contains the struct
* kmem_cache structures of all caches, except kmem_cache itself:
* kmem_cache is statically allocated.
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
* Initially an __init data area is used for the head array and the
* kmem_cache_node structures, it's replaced with a kmalloc allocated
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
* array at the end of the bootstrap.
* 2) Create the first kmalloc cache.
* The struct kmem_cache for the new cache is allocated normally.
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
* An __init data area is used for the head array.
* 3) Create the remaining kmalloc caches, with minimally sized
* head arrays.
* 4) Replace the __init data head arrays for kmem_cache and the first
* kmalloc cache with kmalloc allocated arrays.
* 5) Replace the __init data for kmem_cache_node for kmem_cache and
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
* the other cache's with kmalloc allocated memory.
* 6) Resize the head arrays of the kmalloc caches to their final sizes.
*/
/* 1) create the kmem_cache */
/*
* struct kmem_cache size depends on nr_node_ids & nr_cpu_ids
*/
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
SLAB_HWCACHE_ALIGN);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;
/*
* Initialize the caches that provide memory for the kmem_cache_node
* structures first. Without this, further allocations will bug.
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
*/
kmalloc_caches[INDEX_NODE] = create_kmalloc_cache("kmalloc-node",
kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS);
slab_state = PARTIAL_NODE;
slab: correct size_index table before replacing the bootstrap kmem_cache_node This patch moves the initialization of the size_index table slightly earlier so that the first few kmem_cache_node's can be safely allocated when KMALLOC_MIN_SIZE is large. There are currently two ways to generate indices into kmalloc_caches (via kmalloc_index() and via the size_index table in slab_common.c) and on some arches (possibly only MIPS) they potentially disagree with each other until create_kmalloc_caches() has been called. It seems that the intention is that the size_index table is a fast equivalent to kmalloc_index() and that create_kmalloc_caches() patches the table to return the correct value for the cases where kmalloc_index()'s if-statements apply. The failing sequence was: * kmalloc_caches contains NULL elements * kmem_cache_init initialises the element that 'struct kmem_cache_node' will be allocated to. For 32-bit Mips, this is a 56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7). * init_list is called which calls kmalloc_node to allocate a 'struct kmem_cache_node'. * kmalloc_slab selects the kmem_caches element using size_index[size_index_elem(size)]. For MIPS, size is 56, and the expression returns 6. * This element of kmalloc_caches is NULL and allocation fails. * If it had not already failed, it would have called create_kmalloc_caches() at this point which would have changed size_index[size_index_elem(size)] to 7. I don't believe the bug to be LLVM specific but GCC doesn't normally encounter the problem. I haven't been able to identify exactly what GCC is doing better (probably inlining) but it seems that GCC is managing to optimize to the point that it eliminates the problematic allocations. This theory is supported by the fact that GCC can be made to fail in the same way by changing inline, __inline, __inline__, and __always_inline in include/linux/compiler-gcc.h such that they don't actually inline things. Signed-off-by: Daniel Sanders <daniel.sanders@imgtec.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 07:55:57 +08:00
setup_kmalloc_cache_index_table();
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
slab_early_init = 0;
/* 5) Replace the bootstrap kmem_cache_node */
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
int nid;
for_each_online_node(nid) {
init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);
init_list(kmalloc_caches[INDEX_NODE],
&init_kmem_cache_node[SIZE_NODE + nid], nid);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
}
create_kmalloc_caches(ARCH_KMALLOC_FLAGS);
slab: setup cpu caches later on when interrupts are enabled Fixes the following boot-time warning: [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc() [ 0.000000] Hardware name: [ 0.000000] Modules linked in: [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c [ 0.000000] [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9 [ 0.000000] [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16 [ 0.000000] [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff8108f2be>] smp_call_function+0x3d/0x68 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c [ 0.000000] [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454 [ 0.000000] [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b [ 0.000000] [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593 [ 0.000000] [<ffffffff810f69cf>] enable_cpucache+0x68/0xad [ 0.000000] [<ffffffff818133c3>] kmem_cache_init+0x434/0x593 [ 0.000000] [<ffffffff8180987c>] ? mem_init+0x156/0x161 [ 0.000000] [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9 [ 0.000000] [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae [ 0.000000] [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8 [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12 20:58:59 +08:00
}
void __init kmem_cache_init_late(void)
{
struct kmem_cache *cachep;
slab_state = UP;
slab: setup cpu caches later on when interrupts are enabled Fixes the following boot-time warning: [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc() [ 0.000000] Hardware name: [ 0.000000] Modules linked in: [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c [ 0.000000] [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9 [ 0.000000] [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16 [ 0.000000] [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff8108f2be>] smp_call_function+0x3d/0x68 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c [ 0.000000] [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454 [ 0.000000] [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b [ 0.000000] [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593 [ 0.000000] [<ffffffff810f69cf>] enable_cpucache+0x68/0xad [ 0.000000] [<ffffffff818133c3>] kmem_cache_init+0x434/0x593 [ 0.000000] [<ffffffff8180987c>] ? mem_init+0x156/0x161 [ 0.000000] [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9 [ 0.000000] [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae [ 0.000000] [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8 [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12 20:58:59 +08:00
/* 6) resize the head arrays to their final sizes */
mutex_lock(&slab_mutex);
list_for_each_entry(cachep, &slab_caches, list)
slab: setup cpu caches later on when interrupts are enabled Fixes the following boot-time warning: [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc() [ 0.000000] Hardware name: [ 0.000000] Modules linked in: [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c [ 0.000000] [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9 [ 0.000000] [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16 [ 0.000000] [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff8108f2be>] smp_call_function+0x3d/0x68 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c [ 0.000000] [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454 [ 0.000000] [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b [ 0.000000] [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593 [ 0.000000] [<ffffffff810f69cf>] enable_cpucache+0x68/0xad [ 0.000000] [<ffffffff818133c3>] kmem_cache_init+0x434/0x593 [ 0.000000] [<ffffffff8180987c>] ? mem_init+0x156/0x161 [ 0.000000] [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9 [ 0.000000] [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae [ 0.000000] [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8 [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12 20:58:59 +08:00
if (enable_cpucache(cachep, GFP_NOWAIT))
BUG();
mutex_unlock(&slab_mutex);
/* Done! */
slab_state = FULL;
/*
* Register a cpu startup notifier callback that initializes
* cpu_cache_get for all new cpus
*/
register_cpu_notifier(&cpucache_notifier);
#ifdef CONFIG_NUMA
/*
* Register a memory hotplug callback that initializes and frees
* node.
*/
hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
#endif
/*
* The reap timers are started later, with a module init call: That part
* of the kernel is not yet operational.
*/
}
static int __init cpucache_init(void)
{
int cpu;
/*
* Register the timers that return unneeded pages to the page allocator
*/
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
for_each_online_cpu(cpu)
start_cpu_timer(cpu);
/* Done! */
slab_state = FULL;
return 0;
}
__initcall(cpucache_init);
static noinline void
slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid)
{
#if DEBUG
struct kmem_cache_node *n;
struct page *page;
unsigned long flags;
int node;
static DEFINE_RATELIMIT_STATE(slab_oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
if ((gfpflags & __GFP_NOWARN) || !__ratelimit(&slab_oom_rs))
return;
printk(KERN_WARNING
"SLAB: Unable to allocate memory on node %d (gfp=0x%x)\n",
nodeid, gfpflags);
printk(KERN_WARNING " cache: %s, object size: %d, order: %d\n",
cachep->name, cachep->size, cachep->gfporder);
for_each_kmem_cache_node(cachep, node, n) {
unsigned long active_objs = 0, num_objs = 0, free_objects = 0;
unsigned long active_slabs = 0, num_slabs = 0;
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->slabs_full, lru) {
active_objs += cachep->num;
active_slabs++;
}
list_for_each_entry(page, &n->slabs_partial, lru) {
active_objs += page->active;
active_slabs++;
}
list_for_each_entry(page, &n->slabs_free, lru)
num_slabs++;
free_objects += n->free_objects;
spin_unlock_irqrestore(&n->list_lock, flags);
num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
printk(KERN_WARNING
" node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n",
node, active_slabs, num_slabs, active_objs, num_objs,
free_objects);
}
#endif
}
/*
* Interface to system's page allocator. No need to hold the
* kmem_cache_node ->list_lock.
*
* If we requested dmaable memory, we will get it. Even if we
* did not request dmaable memory, we might get it, but that
* would be relatively rare and ignorable.
*/
static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
int nodeid)
{
struct page *page;
int nr_pages;
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
flags |= cachep->allocflags;
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;
mm: rename alloc_pages_exact_node() to __alloc_pages_node() alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:50 +08:00
page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
slab_out_of_memory(cachep, flags, nodeid);
return NULL;
}
memcg: unify slab and other kmem pages charging We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:49:01 +08:00
if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) {
__free_pages(page, cachep->gfporder);
return NULL;
}
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
mm: make page pfmemalloc check more robust Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-22 05:11:51 +08:00
if (page_is_pfmemalloc(page))
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
pfmemalloc_active = true;
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
NR_SLAB_RECLAIMABLE, nr_pages);
else
add_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_pages);
__SetPageSlab(page);
mm: make page pfmemalloc check more robust Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-22 05:11:51 +08:00
if (page_is_pfmemalloc(page))
SetPageSlabPfmemalloc(page);
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
if (cachep->ctor)
kmemcheck_mark_uninitialized_pages(page, nr_pages);
else
kmemcheck_mark_unallocated_pages(page, nr_pages);
}
return page;
}
/*
* Interface to system's page release.
*/
static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
{
const unsigned long nr_freed = (1 << cachep->gfporder);
kmemcheck_free_shadow(page, cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
sub_zone_page_state(page_zone(page),
NR_SLAB_RECLAIMABLE, nr_freed);
else
sub_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_freed);
BUG_ON(!PageSlab(page));
__ClearPageSlabPfmemalloc(page);
__ClearPageSlab(page);
page_mapcount_reset(page);
page->mapping = NULL;
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += nr_freed;
memcg: unify slab and other kmem pages charging We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:49:01 +08:00
__free_kmem_pages(page, cachep->gfporder);
}
static void kmem_rcu_free(struct rcu_head *head)
{
struct kmem_cache *cachep;
struct page *page;
page = container_of(head, struct page, rcu_head);
cachep = page->slab_cache;
kmem_freepages(cachep, page);
}
#if DEBUG
#ifdef CONFIG_DEBUG_PAGEALLOC
static void store_stackinfo(struct kmem_cache *cachep, unsigned long *addr,
unsigned long caller)
{
int size = cachep->object_size;
addr = (unsigned long *)&((char *)addr)[obj_offset(cachep)];
if (size < 5 * sizeof(unsigned long))
return;
*addr++ = 0x12345678;
*addr++ = caller;
*addr++ = smp_processor_id();
size -= 3 * sizeof(unsigned long);
{
unsigned long *sptr = &caller;
unsigned long svalue;
while (!kstack_end(sptr)) {
svalue = *sptr++;
if (kernel_text_address(svalue)) {
*addr++ = svalue;
size -= sizeof(unsigned long);
if (size <= sizeof(unsigned long))
break;
}
}
}
*addr++ = 0x87654321;
}
#endif
static void poison_obj(struct kmem_cache *cachep, void *addr, unsigned char val)
{
int size = cachep->object_size;
addr = &((char *)addr)[obj_offset(cachep)];
memset(addr, val, size);
*(unsigned char *)(addr + size - 1) = POISON_END;
}
static void dump_line(char *data, int offset, int limit)
{
int i;
unsigned char error = 0;
int bad_count = 0;
printk(KERN_ERR "%03x: ", offset);
for (i = 0; i < limit; i++) {
if (data[offset + i] != POISON_FREE) {
error = data[offset + i];
bad_count++;
}
}
print_hex_dump(KERN_CONT, "", 0, 16, 1,
&data[offset], limit, 1);
if (bad_count == 1) {
error ^= POISON_FREE;
if (!(error & (error - 1))) {
printk(KERN_ERR "Single bit error detected. Probably "
"bad RAM.\n");
#ifdef CONFIG_X86
printk(KERN_ERR "Run memtest86+ or a similar memory "
"test tool.\n");
#else
printk(KERN_ERR "Run a memory test tool.\n");
#endif
}
}
}
#endif
#if DEBUG
static void print_objinfo(struct kmem_cache *cachep, void *objp, int lines)
{
int i, size;
char *realobj;
if (cachep->flags & SLAB_RED_ZONE) {
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
printk(KERN_ERR "Redzone: 0x%llx/0x%llx.\n",
*dbg_redzone1(cachep, objp),
*dbg_redzone2(cachep, objp));
}
if (cachep->flags & SLAB_STORE_USER) {
printk(KERN_ERR "Last user: [<%p>](%pSR)\n",
*dbg_userword(cachep, objp),
*dbg_userword(cachep, objp));
}
realobj = (char *)objp + obj_offset(cachep);
size = cachep->object_size;
for (i = 0; i < size && lines; i += 16, lines--) {
int limit;
limit = 16;
if (i + limit > size)
limit = size - i;
dump_line(realobj, i, limit);
}
}
static void check_poison_obj(struct kmem_cache *cachep, void *objp)
{
char *realobj;
int size, i;
int lines = 0;
realobj = (char *)objp + obj_offset(cachep);
size = cachep->object_size;
for (i = 0; i < size; i++) {
char exp = POISON_FREE;
if (i == size - 1)
exp = POISON_END;
if (realobj[i] != exp) {
int limit;
/* Mismatch ! */
/* Print header */
if (lines == 0) {
printk(KERN_ERR
"Slab corruption (%s): %s start=%p, len=%d\n",
print_tainted(), cachep->name, realobj, size);
print_objinfo(cachep, objp, 0);
}
/* Hexdump the affected line */
i = (i / 16) * 16;
limit = 16;
if (i + limit > size)
limit = size - i;
dump_line(realobj, i, limit);
i += 16;
lines++;
/* Limit to 5 lines */
if (lines > 5)
break;
}
}
if (lines != 0) {
/* Print some data about the neighboring objects, if they
* exist:
*/
struct page *page = virt_to_head_page(objp);
unsigned int objnr;
objnr = obj_to_index(cachep, page, objp);
if (objnr) {
objp = index_to_obj(cachep, page, objnr - 1);
realobj = (char *)objp + obj_offset(cachep);
printk(KERN_ERR "Prev obj: start=%p, len=%d\n",
realobj, size);
print_objinfo(cachep, objp, 2);
}
if (objnr + 1 < cachep->num) {
objp = index_to_obj(cachep, page, objnr + 1);
realobj = (char *)objp + obj_offset(cachep);
printk(KERN_ERR "Next obj: start=%p, len=%d\n",
realobj, size);
print_objinfo(cachep, objp, 2);
}
}
}
#endif
#if DEBUG
static void slab_destroy_debugcheck(struct kmem_cache *cachep,
struct page *page)
{
int i;
for (i = 0; i < cachep->num; i++) {
void *objp = index_to_obj(cachep, page, i);
if (cachep->flags & SLAB_POISON) {
#ifdef CONFIG_DEBUG_PAGEALLOC
if (cachep->size % PAGE_SIZE == 0 &&
OFF_SLAB(cachep))
kernel_map_pages(virt_to_page(objp),
cachep->size / PAGE_SIZE, 1);
else
check_poison_obj(cachep, objp);
#else
check_poison_obj(cachep, objp);
#endif
}
if (cachep->flags & SLAB_RED_ZONE) {
if (*dbg_redzone1(cachep, objp) != RED_INACTIVE)
slab_error(cachep, "start of a freed object "
"was overwritten");
if (*dbg_redzone2(cachep, objp) != RED_INACTIVE)
slab_error(cachep, "end of a freed object "
"was overwritten");
}
}
}
#else
static void slab_destroy_debugcheck(struct kmem_cache *cachep,
struct page *page)
{
}
#endif
/**
* slab_destroy - destroy and release all objects in a slab
* @cachep: cache pointer being destroyed
* @page: page pointer being destroyed
*
* Destroy all the objs in a slab page, and release the mem back to the system.
* Before calling the slab page must have been unlinked from the cache. The
* kmem_cache_node ->list_lock is not held/needed.
*/
static void slab_destroy(struct kmem_cache *cachep, struct page *page)
{
void *freelist;
freelist = page->freelist;
slab_destroy_debugcheck(cachep, page);
if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
call_rcu(&page->rcu_head, kmem_rcu_free);
else
kmem_freepages(cachep, page);
/*
* From now on, we don't use freelist
* although actual page can be freed in rcu context
*/
if (OFF_SLAB(cachep))
kmem_cache_free(cachep->freelist_cache, freelist);
}
static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
{
struct page *page, *n;
list_for_each_entry_safe(page, n, list, lru) {
list_del(&page->lru);
slab_destroy(cachep, page);
}
}
/**
* calculate_slab_order - calculate size (page order) of slabs
* @cachep: pointer to the cache that is being created
* @size: size of objects to be created in this cache.
* @align: required alignment for the objects.
* @flags: slab allocation flags
*
* Also calculates the number of objects per slab.
*
* This could be made much more intelligent. For now, try to avoid using
* high order pages for slabs. When the gfp() functions are more friendly
* towards high-order requests, this should be changed.
*/
static size_t calculate_slab_order(struct kmem_cache *cachep,
size_t size, size_t align, unsigned long flags)
{
[PATCH] slab.c: fix offslab_limit bug mm/slab.c's offlab_limit logic is totally broken. Firstly, "offslab_limit" is a global variable while it should either be calculated in situ or should be passed in as a parameter. Secondly, the more serious problem with it is that the condition for calculating it: if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size - sizeof(struct slab); offslab_limit /= sizeof(kmem_bufctl_t); is in total disconnect with the condition that makes use of it: /* More than offslab_limit objects will cause problems */ if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; but due to offslab_limit being a global variable this breakage was hidden. Up until lockdep came along and perturbed the slab sizes sufficiently so that the first off-slab cache would still see a (non-calculated) zero value for offslab_limit and would panic with: kmem_cache_create: couldn't create cache size-512. Call Trace: [<ffffffff8020a5b9>] show_trace+0x96/0x1c8 [<ffffffff8020a8f0>] dump_stack+0x13/0x15 [<ffffffff8022994f>] panic+0x39/0x21a [<ffffffff80270814>] kmem_cache_create+0x5a0/0x5d0 [<ffffffff80aced62>] kmem_cache_init+0x193/0x379 [<ffffffff80abf779>] start_kernel+0x17f/0x218 [<ffffffff80abf263>] _sinittext+0x263/0x26a Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512' Paolo Ornati's config on x86_64 managed to trigger it. The fix is to move the calculation to the place that makes use of it. This also makes slab.o 54 bytes smaller. Btw., the check itself is quite silly. Its intention is to test whether the number of objects per slab would be higher than the number of slab control pointers possible. In theory it could be triggered: if someone tried to allocate 4-byte objects cache and explicitly requested with CFLGS_OFF_SLAB. So i kept the check. Out of historic interest i checked how old this bug was and it's ancient, 10 years old! It is the oldest hidden and then truly triggering bugs i ever saw being fixed in the kernel! Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-02 21:44:58 +08:00
unsigned long offslab_limit;
size_t left_over = 0;
int gfporder;
for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
unsigned int num;
size_t remainder;
cache_estimate(gfporder, size, align, flags, &remainder, &num);
if (!num)
continue;
/* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
if (num > SLAB_OBJ_MAX_NUM)
break;
[PATCH] slab.c: fix offslab_limit bug mm/slab.c's offlab_limit logic is totally broken. Firstly, "offslab_limit" is a global variable while it should either be calculated in situ or should be passed in as a parameter. Secondly, the more serious problem with it is that the condition for calculating it: if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size - sizeof(struct slab); offslab_limit /= sizeof(kmem_bufctl_t); is in total disconnect with the condition that makes use of it: /* More than offslab_limit objects will cause problems */ if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; but due to offslab_limit being a global variable this breakage was hidden. Up until lockdep came along and perturbed the slab sizes sufficiently so that the first off-slab cache would still see a (non-calculated) zero value for offslab_limit and would panic with: kmem_cache_create: couldn't create cache size-512. Call Trace: [<ffffffff8020a5b9>] show_trace+0x96/0x1c8 [<ffffffff8020a8f0>] dump_stack+0x13/0x15 [<ffffffff8022994f>] panic+0x39/0x21a [<ffffffff80270814>] kmem_cache_create+0x5a0/0x5d0 [<ffffffff80aced62>] kmem_cache_init+0x193/0x379 [<ffffffff80abf779>] start_kernel+0x17f/0x218 [<ffffffff80abf263>] _sinittext+0x263/0x26a Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512' Paolo Ornati's config on x86_64 managed to trigger it. The fix is to move the calculation to the place that makes use of it. This also makes slab.o 54 bytes smaller. Btw., the check itself is quite silly. Its intention is to test whether the number of objects per slab would be higher than the number of slab control pointers possible. In theory it could be triggered: if someone tried to allocate 4-byte objects cache and explicitly requested with CFLGS_OFF_SLAB. So i kept the check. Out of historic interest i checked how old this bug was and it's ancient, 10 years old! It is the oldest hidden and then truly triggering bugs i ever saw being fixed in the kernel! Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-02 21:44:58 +08:00
if (flags & CFLGS_OFF_SLAB) {
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
size_t freelist_size_per_obj = sizeof(freelist_idx_t);
[PATCH] slab.c: fix offslab_limit bug mm/slab.c's offlab_limit logic is totally broken. Firstly, "offslab_limit" is a global variable while it should either be calculated in situ or should be passed in as a parameter. Secondly, the more serious problem with it is that the condition for calculating it: if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size - sizeof(struct slab); offslab_limit /= sizeof(kmem_bufctl_t); is in total disconnect with the condition that makes use of it: /* More than offslab_limit objects will cause problems */ if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; but due to offslab_limit being a global variable this breakage was hidden. Up until lockdep came along and perturbed the slab sizes sufficiently so that the first off-slab cache would still see a (non-calculated) zero value for offslab_limit and would panic with: kmem_cache_create: couldn't create cache size-512. Call Trace: [<ffffffff8020a5b9>] show_trace+0x96/0x1c8 [<ffffffff8020a8f0>] dump_stack+0x13/0x15 [<ffffffff8022994f>] panic+0x39/0x21a [<ffffffff80270814>] kmem_cache_create+0x5a0/0x5d0 [<ffffffff80aced62>] kmem_cache_init+0x193/0x379 [<ffffffff80abf779>] start_kernel+0x17f/0x218 [<ffffffff80abf263>] _sinittext+0x263/0x26a Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512' Paolo Ornati's config on x86_64 managed to trigger it. The fix is to move the calculation to the place that makes use of it. This also makes slab.o 54 bytes smaller. Btw., the check itself is quite silly. Its intention is to test whether the number of objects per slab would be higher than the number of slab control pointers possible. In theory it could be triggered: if someone tried to allocate 4-byte objects cache and explicitly requested with CFLGS_OFF_SLAB. So i kept the check. Out of historic interest i checked how old this bug was and it's ancient, 10 years old! It is the oldest hidden and then truly triggering bugs i ever saw being fixed in the kernel! Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-02 21:44:58 +08:00
/*
* Max number of objs-per-slab for caches which
* use off-slab slabs. Needed to avoid a possible
* looping condition in cache_grow().
*/
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK))
freelist_size_per_obj += sizeof(char);
offslab_limit = size;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
offslab_limit /= freelist_size_per_obj;
[PATCH] slab.c: fix offslab_limit bug mm/slab.c's offlab_limit logic is totally broken. Firstly, "offslab_limit" is a global variable while it should either be calculated in situ or should be passed in as a parameter. Secondly, the more serious problem with it is that the condition for calculating it: if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size - sizeof(struct slab); offslab_limit /= sizeof(kmem_bufctl_t); is in total disconnect with the condition that makes use of it: /* More than offslab_limit objects will cause problems */ if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; but due to offslab_limit being a global variable this breakage was hidden. Up until lockdep came along and perturbed the slab sizes sufficiently so that the first off-slab cache would still see a (non-calculated) zero value for offslab_limit and would panic with: kmem_cache_create: couldn't create cache size-512. Call Trace: [<ffffffff8020a5b9>] show_trace+0x96/0x1c8 [<ffffffff8020a8f0>] dump_stack+0x13/0x15 [<ffffffff8022994f>] panic+0x39/0x21a [<ffffffff80270814>] kmem_cache_create+0x5a0/0x5d0 [<ffffffff80aced62>] kmem_cache_init+0x193/0x379 [<ffffffff80abf779>] start_kernel+0x17f/0x218 [<ffffffff80abf263>] _sinittext+0x263/0x26a Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512' Paolo Ornati's config on x86_64 managed to trigger it. The fix is to move the calculation to the place that makes use of it. This also makes slab.o 54 bytes smaller. Btw., the check itself is quite silly. Its intention is to test whether the number of objects per slab would be higher than the number of slab control pointers possible. In theory it could be triggered: if someone tried to allocate 4-byte objects cache and explicitly requested with CFLGS_OFF_SLAB. So i kept the check. Out of historic interest i checked how old this bug was and it's ancient, 10 years old! It is the oldest hidden and then truly triggering bugs i ever saw being fixed in the kernel! Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-02 21:44:58 +08:00
if (num > offslab_limit)
break;
}
/* Found something acceptable - save it away */
cachep->num = num;
cachep->gfporder = gfporder;
left_over = remainder;
/*
* A VFS-reclaimable slab tends to have most allocations
* as GFP_NOFS and we really don't want to have to be allocating
* higher-order pages when we are unable to shrink dcache.
*/
if (flags & SLAB_RECLAIM_ACCOUNT)
break;
/*
* Large number of objects is good, but very large slabs are
* currently bad for the gfp()s.
*/
if (gfporder >= slab_max_order)
break;
/*
* Acceptable internal fragmentation?
*/
if (left_over * 8 <= (PAGE_SIZE << gfporder))
break;
}
return left_over;
}
static struct array_cache __percpu *alloc_kmem_cache_cpus(
struct kmem_cache *cachep, int entries, int batchcount)
{
int cpu;
size_t size;
struct array_cache __percpu *cpu_cache;
size = sizeof(void *) * entries + sizeof(struct array_cache);
cpu_cache = __alloc_percpu(size, sizeof(void *));
if (!cpu_cache)
return NULL;
for_each_possible_cpu(cpu) {
init_arraycache(per_cpu_ptr(cpu_cache, cpu),
entries, batchcount);
}
return cpu_cache;
}
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
{
if (slab_state >= FULL)
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
return enable_cpucache(cachep, gfp);
cachep->cpu_cache = alloc_kmem_cache_cpus(cachep, 1, 1);
if (!cachep->cpu_cache)
return 1;
if (slab_state == DOWN) {
/* Creation of first cache (kmem_cache). */
set_up_node(kmem_cache, CACHE_CACHE);
} else if (slab_state == PARTIAL) {
/* For kmem_cache_node */
set_up_node(cachep, SIZE_NODE);
} else {
int node;
for_each_online_node(node) {
cachep->node[node] = kmalloc_node(
sizeof(struct kmem_cache_node), gfp, node);
BUG_ON(!cachep->node[node]);
kmem_cache_node_init(cachep->node[node]);
}
}
cachep->node[numa_mem_id()]->next_reap =
jiffies + REAPTIMEOUT_NODE +
((unsigned long)cachep) % REAPTIMEOUT_NODE;
cpu_cache_get(cachep)->avail = 0;
cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
cpu_cache_get(cachep)->batchcount = 1;
cpu_cache_get(cachep)->touched = 0;
cachep->batchcount = 1;
cachep->limit = BOOT_CPUCACHE_ENTRIES;
return 0;
}
unsigned long kmem_cache_flags(unsigned long object_size,
unsigned long flags, const char *name,
void (*ctor)(void *))
{
return flags;
}
struct kmem_cache *
__kmem_cache_alias(const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *))
{
struct kmem_cache *cachep;
cachep = find_mergeable(size, align, flags, name, ctor);
if (cachep) {
cachep->refcount++;
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc.
*/
cachep->object_size = max_t(int, cachep->object_size, size);
}
return cachep;
}
/**
* __kmem_cache_create - Create a cache.
* @cachep: cache management descriptor
* @flags: SLAB flags
*
* Returns a ptr to the cache on success, NULL on failure.
* Cannot be called within a int, but can be interrupted.
* The @ctor is run when new pages are allocated by the cache.
*
* The flags are
*
* %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
* to catch references to uninitialised memory.
*
* %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
* for buffer overruns.
*
* %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
* cacheline. This can be beneficial if you're counting cycles as closely
* as davem.
*/
int
__kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
{
size_t left_over, freelist_size;
size_t ralign = BYTES_PER_WORD;
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
gfp_t gfp;
int err;
size_t size = cachep->size;
#if DEBUG
#if FORCED_DEBUG
/*
* Enable redzoning and last user accounting, except for caches with
* large objects, if the increased size would increase the object size
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
if (!(flags & SLAB_DESTROY_BY_RCU))
flags |= SLAB_POISON;
#endif
if (flags & SLAB_DESTROY_BY_RCU)
BUG_ON(flags & SLAB_POISON);
#endif
/*
* Check that size is in terms of words. This is needed to avoid
* unaligned accesses for some archs when redzoning is used, and makes
* sure any on-slab bufctl's are also correctly aligned.
*/
if (size & (BYTES_PER_WORD - 1)) {
size += (BYTES_PER_WORD - 1);
size &= ~(BYTES_PER_WORD - 1);
}
if (flags & SLAB_RED_ZONE) {
ralign = REDZONE_ALIGN;
/* If redzoning, ensure that the second redzone is suitably
* aligned, by adjusting the object size accordingly. */
size += REDZONE_ALIGN - 1;
size &= ~(REDZONE_ALIGN - 1);
}
/* 3) caller mandated alignment */
if (ralign < cachep->align) {
ralign = cachep->align;
}
Revert "slab: Fix missing DEBUG_SLAB last user" This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e. The commit breaks ARM thusly: | Mount-cache hash table entries: 512 | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten | Backtrace: | [<c0227088>] (dump_backtrace+0x0/0x110) from [<c0431afc>] (dump_stack+0x18/0x1c) | [<c0431ae4>] (dump_stack+0x0/0x1c) from [<c0293304>] (__slab_error+0x28/0x30) | [<c02932dc>] (__slab_error+0x0/0x30) from [<c0293a74>] (cache_free_debugcheck+0x1c0/0x2b8) | [<c02938b4>] (cache_free_debugcheck+0x0/0x2b8) from [<c0293f78>] (kmem_cache_free+0x3c/0xc0) | [<c0293f3c>] (kmem_cache_free+0x0/0xc0) from [<c032b1c8>] (ida_get_new_above+0x19c/0x1c0) | [<c032b02c>] (ida_get_new_above+0x0/0x1c0) from [<c02af7ec>] (alloc_vfsmnt+0x54/0x144) | [<c02af798>] (alloc_vfsmnt+0x0/0x144) from [<c0299830>] (vfs_kern_mount+0x30/0xec) | [<c0299800>] (vfs_kern_mount+0x0/0xec) from [<c0299908>] (kern_mount_data+0x1c/0x20) | [<c02998ec>] (kern_mount_data+0x0/0x20) from [<c02146c4>] (sysfs_init+0x68/0xc8) | [<c021465c>] (sysfs_init+0x0/0xc8) from [<c02137d4>] (mnt_init+0x90/0x1b0) | [<c0213744>] (mnt_init+0x0/0x1b0) from [<c0213388>] (vfs_caches_init+0x100/0x140) | [<c0213288>] (vfs_caches_init+0x0/0x140) from [<c0208c0c>] (start_kernel+0x2e8/0x368) | [<c0208924>] (start_kernel+0x0/0x368) from [<c0208034>] (__enable_mmu+0x0/0x2c) | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0. | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten | ... | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump: | | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0 | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00 | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00 | 070: 16 00 00 00 17 00 00 00 c0 88 56 63 | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928! Reference: https://lkml.org/lkml/2011/2/7/238 Cc: <stable@kernel.org> # 2.6.35.y and later Reported-and-analyzed-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-14 23:46:21 +08:00
/* disable debug if necessary */
if (ralign > __alignof__(unsigned long long))
flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
/*
* 4) Store it.
*/
cachep->align = ralign;
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
if (slab_is_available())
gfp = GFP_KERNEL;
else
gfp = GFP_NOWAIT;
#if DEBUG
/*
* Both debugging options require word-alignment which is calculated
* into align above.
*/
if (flags & SLAB_RED_ZONE) {
/* add space for red zone words */
Revert "slab: Fix missing DEBUG_SLAB last user" This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e. The commit breaks ARM thusly: | Mount-cache hash table entries: 512 | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten | Backtrace: | [<c0227088>] (dump_backtrace+0x0/0x110) from [<c0431afc>] (dump_stack+0x18/0x1c) | [<c0431ae4>] (dump_stack+0x0/0x1c) from [<c0293304>] (__slab_error+0x28/0x30) | [<c02932dc>] (__slab_error+0x0/0x30) from [<c0293a74>] (cache_free_debugcheck+0x1c0/0x2b8) | [<c02938b4>] (cache_free_debugcheck+0x0/0x2b8) from [<c0293f78>] (kmem_cache_free+0x3c/0xc0) | [<c0293f3c>] (kmem_cache_free+0x0/0xc0) from [<c032b1c8>] (ida_get_new_above+0x19c/0x1c0) | [<c032b02c>] (ida_get_new_above+0x0/0x1c0) from [<c02af7ec>] (alloc_vfsmnt+0x54/0x144) | [<c02af798>] (alloc_vfsmnt+0x0/0x144) from [<c0299830>] (vfs_kern_mount+0x30/0xec) | [<c0299800>] (vfs_kern_mount+0x0/0xec) from [<c0299908>] (kern_mount_data+0x1c/0x20) | [<c02998ec>] (kern_mount_data+0x0/0x20) from [<c02146c4>] (sysfs_init+0x68/0xc8) | [<c021465c>] (sysfs_init+0x0/0xc8) from [<c02137d4>] (mnt_init+0x90/0x1b0) | [<c0213744>] (mnt_init+0x0/0x1b0) from [<c0213388>] (vfs_caches_init+0x100/0x140) | [<c0213288>] (vfs_caches_init+0x0/0x140) from [<c0208c0c>] (start_kernel+0x2e8/0x368) | [<c0208924>] (start_kernel+0x0/0x368) from [<c0208034>] (__enable_mmu+0x0/0x2c) | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0. | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten | ... | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump: | | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0 | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00 | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00 | 070: 16 00 00 00 17 00 00 00 c0 88 56 63 | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928! Reference: https://lkml.org/lkml/2011/2/7/238 Cc: <stable@kernel.org> # 2.6.35.y and later Reported-and-analyzed-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-14 23:46:21 +08:00
cachep->obj_offset += sizeof(unsigned long long);
size += 2 * sizeof(unsigned long long);
}
if (flags & SLAB_STORE_USER) {
/* user store requires one word storage behind the end of
* the real object. But if the second red zone needs to be
* aligned to 64 bits, we must allow that much space.
*/
if (flags & SLAB_RED_ZONE)
size += REDZONE_ALIGN;
else
size += BYTES_PER_WORD;
}
#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1) Commit description is copied from the original post of this bug: http://comments.gmane.org/gmane.linux.kernel.mm/135349 Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next larger cache size than the size index INDEX_NODE mapping. In kernels 3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size. However, sometimes we can't get the right output we expected via kmalloc_size(INDEX_NODE + 1), causing a BUG(). The mapping table in the latest kernel is like: index = {0, 1, 2 , 3, 4, 5, 6, n} size = {0, 96, 192, 8, 16, 32, 64, 2^n} The mapping table before 3.10 is like this: index = {0 , 1 , 2, 3, 4 , 5 , 6, n} size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)} The problem on my mips64 machine is as follows: (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150", and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8. (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size = PAGE_SIZE". (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and "flags |= CFLGS_OFF_SLAB" will be covered. (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result here may be NULL while kernel bootup. (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the BUG info as the following shows (may be only mips64 has this problem): This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes the BUG by adding 'size >= 256' check to guarantee that all necessary small sized slabs are initialized regardless sequence of slab size in mapping table. Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-02 06:36:54 +08:00
/*
* To activate debug pagealloc, off-slab management is necessary
* requirement. In early phase of initialization, small sized slab
* doesn't get initialized so it would not be possible. So, we need
* to check size >= 256. It guarantees that all necessary small
* sized slab is initialized in current slab initialization sequence.
*/
if (!slab_early_init && size >= kmalloc_size(INDEX_NODE) &&
size >= 256 && cachep->object_size > cache_line_size() &&
ALIGN(size, cachep->align) < PAGE_SIZE) {
cachep->obj_offset += PAGE_SIZE - ALIGN(size, cachep->align);
size = PAGE_SIZE;
}
#endif
#endif
/*
* Determine if the slab management is 'on' or 'off' slab.
* (bootstrapping cannot cope with offslab caches so don't do
* it too early on. Always use on-slab management when
* SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak)
*/
mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:45:54 +08:00
if (size >= OFF_SLAB_MIN_SIZE && !slab_early_init &&
!(flags & SLAB_NOLEAKTRACE))
/*
* Size is large, assume best to place the slab management obj
* off-slab (should allow better packing of objs).
*/
flags |= CFLGS_OFF_SLAB;
size = ALIGN(size, cachep->align);
/*
* We should restrict the number of objects in a slab to implement
* byte sized index. Refer comment on SLAB_OBJ_MIN_SIZE definition.
*/
if (FREELIST_BYTE_INDEX && size < SLAB_OBJ_MIN_SIZE)
size = ALIGN(SLAB_OBJ_MIN_SIZE, cachep->align);
left_over = calculate_slab_order(cachep, size, cachep->align, flags);
if (!cachep->num)
return -E2BIG;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
freelist_size = calculate_freelist_size(cachep->num, cachep->align);
/*
* If the slab has been placed off-slab, and we have enough space then
* move it on-slab. This is at the expense of any extra colouring.
*/
if (flags & CFLGS_OFF_SLAB && left_over >= freelist_size) {
flags &= ~CFLGS_OFF_SLAB;
left_over -= freelist_size;
}
if (flags & CFLGS_OFF_SLAB) {
/* really off slab. No need for manual alignment */
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
freelist_size = calculate_freelist_size(cachep->num, 0);
#ifdef CONFIG_PAGE_POISONING
/* If we're going to use the generic kernel_map_pages()
* poisoning, then it's going to smash the contents of
* the redzone and userword anyhow, so switch them off.
*/
if (size % PAGE_SIZE == 0 && flags & SLAB_POISON)
flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif
}
cachep->colour_off = cache_line_size();
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;
cachep->colour = left_over / cachep->colour_off;
cachep->freelist_size = freelist_size;
cachep->flags = flags;
cachep->allocflags = __GFP_COMP;
if (CONFIG_ZONE_DMA_FLAG && (flags & SLAB_CACHE_DMA))
cachep->allocflags |= GFP_DMA;
cachep->size = size;
2006-12-13 16:34:27 +08:00
cachep->reciprocal_buffer_size = reciprocal_value(size);
if (flags & CFLGS_OFF_SLAB) {
cachep->freelist_cache = kmalloc_slab(freelist_size, 0u);
/*
* This is a possibility for one of the kmalloc_{dma,}_caches.
* But since we go off slab only for object size greater than
mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:45:54 +08:00
* OFF_SLAB_MIN_SIZE, and kmalloc_{dma,}_caches get created
* in ascending order,this should not happen at all.
* But leave a BUG_ON for some lucky dude.
*/
BUG_ON(ZERO_OR_NULL_PTR(cachep->freelist_cache));
}
err = setup_cpu_cache(cachep, gfp);
if (err) {
__kmem_cache_shutdown(cachep);
return err;
}
return 0;
}
#if DEBUG
static void check_irq_off(void)
{
BUG_ON(!irqs_disabled());
}
static void check_irq_on(void)
{
BUG_ON(irqs_disabled());
}
static void check_spinlock_acquired(struct kmem_cache *cachep)
{
#ifdef CONFIG_SMP
check_irq_off();
assert_spin_locked(&get_node(cachep, numa_mem_id())->list_lock);
#endif
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
static void check_spinlock_acquired_node(struct kmem_cache *cachep, int node)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
#ifdef CONFIG_SMP
check_irq_off();
assert_spin_locked(&get_node(cachep, node)->list_lock);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#endif
}
#else
#define check_irq_off() do { } while(0)
#define check_irq_on() do { } while(0)
#define check_spinlock_acquired(x) do { } while(0)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#define check_spinlock_acquired_node(x, y) do { } while(0)
#endif
static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
struct array_cache *ac,
int force, int node);
static void do_drain(void *arg)
{
struct kmem_cache *cachep = arg;
struct array_cache *ac;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int node = numa_mem_id();
struct kmem_cache_node *n;
LIST_HEAD(list);
check_irq_off();
ac = cpu_cache_get(cachep);
n = get_node(cachep, node);
spin_lock(&n->list_lock);
free_block(cachep, ac->entry, ac->avail, node, &list);
spin_unlock(&n->list_lock);
slabs_destroy(cachep, &list);
ac->avail = 0;
}
static void drain_cpu_caches(struct kmem_cache *cachep)
{
struct kmem_cache_node *n;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
int node;
on_each_cpu(do_drain, cachep, 1);
check_irq_on();
for_each_kmem_cache_node(cachep, node, n)
if (n->alien)
drain_alien_cache(cachep, n->alien);
for_each_kmem_cache_node(cachep, node, n)
drain_array(cachep, n, n->shared, 1, node);
}
/*
* Remove slabs from the list of free slabs.
* Specify the number of slabs to drain in tofree.
*
* Returns the actual number of slabs released.
*/
static int drain_freelist(struct kmem_cache *cache,
struct kmem_cache_node *n, int tofree)
{
struct list_head *p;
int nr_freed;
struct page *page;
nr_freed = 0;
while (nr_freed < tofree && !list_empty(&n->slabs_free)) {
spin_lock_irq(&n->list_lock);
p = n->slabs_free.prev;
if (p == &n->slabs_free) {
spin_unlock_irq(&n->list_lock);
goto out;
}
page = list_entry(p, struct page, lru);
#if DEBUG
BUG_ON(page->active);
#endif
list_del(&page->lru);
/*
* Safe to drop the lock. The slab is no longer linked
* to the cache.
*/
n->free_objects -= cache->num;
spin_unlock_irq(&n->list_lock);
slab_destroy(cache, page);
nr_freed++;
}
out:
return nr_freed;
}
slub: make dead caches discard free slabs immediately To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
int ret = 0;
int node;
struct kmem_cache_node *n;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
drain_cpu_caches(cachep);
check_irq_on();
for_each_kmem_cache_node(cachep, node, n) {
drain_freelist(cachep, n, slabs_tofree(cachep, n));
ret += !list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
return (ret ? 1 : 0);
}
int __kmem_cache_shutdown(struct kmem_cache *cachep)
{
int i;
struct kmem_cache_node *n;
slub: make dead caches discard free slabs immediately To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
int rc = __kmem_cache_shrink(cachep, false);
if (rc)
return rc;
free_percpu(cachep->cpu_cache);
/* NUMA: free the node structures */
for_each_kmem_cache_node(cachep, i, n) {
kfree(n->shared);
free_alien_cache(n->alien);
kfree(n);
cachep->node[i] = NULL;
}
return 0;
}
/*
* Get the memory for a slab management obj.
*
* For a slab cache when the slab descriptor is off-slab, the
* slab descriptor can't come from the same cache which is being created,
* Because if it is the case, that means we defer the creation of
* the kmalloc_{dma,}_cache of size sizeof(slab descriptor) to this point.
* And we eventually call down to __kmem_cache_create(), which
* in turn looks up in the kmalloc_{dma,}_caches for the disired-size one.
* This is a "chicken-and-egg" problem.
*
* So the off-slab slab descriptor shall come from the kmalloc_{dma,}_caches,
* which are all initialized during kmem_cache_init().
*/
static void *alloc_slabmgmt(struct kmem_cache *cachep,
struct page *page, int colour_off,
gfp_t local_flags, int nodeid)
{
void *freelist;
void *addr = page_address(page);
if (OFF_SLAB(cachep)) {
/* Slab management obj is off-slab. */
freelist = kmem_cache_alloc_node(cachep->freelist_cache,
local_flags, nodeid);
if (!freelist)
return NULL;
} else {
freelist = addr + colour_off;
colour_off += cachep->freelist_size;
}
page->active = 0;
page->s_mem = addr + colour_off;
return freelist;
}
slab: fix the type of the index on freelist index accessor Commit a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") changes the size of freelist index and also changes prototype of accessor function to freelist index. And there was a mistake. The mistake is that although it changes the size of freelist index correctly, it changes the size of the index of freelist index incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that means that num of object on on a slab can be more than 255. So we need more than 1 byte for the index to find the index of free object on freelist. But, above patch makes this index type 1 byte, so slab which have more than 255 objects cannot work properly and in consequence of it, the system cannot boot. This issue was reported by Steven King on m68knommu which would use 2 bytes freelist index: https://lkml.org/lkml/2014/4/16/433 To fix is easy. To change the type of the index of freelist index on accessor functions is enough to fix this bug. Although 2 bytes is enough, I use 4 bytes since it have no bad effect and make things more easier. This fix was suggested and tested by Steven in his original report. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18 15:24:09 +08:00
static inline freelist_idx_t get_free_obj(struct page *page, unsigned int idx)
{
slab: introduce byte sized index for the freelist of a slab Currently, the freelist of a slab consist of unsigned int sized indexes. Since most of slabs have less number of objects than 256, large sized indexes is needless. For example, consider the minimum kmalloc slab. It's object size is 32 byte and it would consist of one page, so 256 indexes through byte sized index are enough to contain all possible indexes. There can be some slabs whose object size is 8 byte. We cannot handle this case with byte sized index, so we need to restrict minimum object size. Since these slabs are not major, wasted memory from these slabs would be negligible. Some architectures' page size isn't 4096 bytes and rather larger than 4096 bytes (One example is 64KB page size on PPC or IA64) so that byte sized index doesn't fit to them. In this case, we will use two bytes sized index. Below is some number for this patch. * Before * kmalloc-512 525 640 512 8 1 : tunables 54 27 0 : slabdata 80 80 0 kmalloc-256 210 210 256 15 1 : tunables 120 60 0 : slabdata 14 14 0 kmalloc-192 1016 1040 192 20 1 : tunables 120 60 0 : slabdata 52 52 0 kmalloc-96 560 620 128 31 1 : tunables 120 60 0 : slabdata 20 20 0 kmalloc-64 2148 2280 64 60 1 : tunables 120 60 0 : slabdata 38 38 0 kmalloc-128 647 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11360 11413 32 113 1 : tunables 120 60 0 : slabdata 101 101 0 kmem_cache 197 200 192 20 1 : tunables 120 60 0 : slabdata 10 10 0 * After * kmalloc-512 521 648 512 8 1 : tunables 54 27 0 : slabdata 81 81 0 kmalloc-256 208 208 256 16 1 : tunables 120 60 0 : slabdata 13 13 0 kmalloc-192 1029 1029 192 21 1 : tunables 120 60 0 : slabdata 49 49 0 kmalloc-96 529 589 128 31 1 : tunables 120 60 0 : slabdata 19 19 0 kmalloc-64 2142 2142 64 63 1 : tunables 120 60 0 : slabdata 34 34 0 kmalloc-128 660 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11716 11780 32 124 1 : tunables 120 60 0 : slabdata 95 95 0 kmem_cache 197 210 192 21 1 : tunables 120 60 0 : slabdata 10 10 0 kmem_caches consisting of objects less than or equal to 256 byte have one or more objects than before. In the case of kmalloc-32, we have 11 more objects, so 352 bytes (11 * 32) are saved and this is roughly 9% saving of memory. Of couse, this percentage decreases as the number of objects in a slab decreases. Here are the performance results on my 4 cpus machine. * Before * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 229,945,138 cache-misses ( +- 0.23% ) 11.627897174 seconds time elapsed ( +- 0.14% ) * After * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 218,640,472 cache-misses ( +- 0.42% ) 11.504999837 seconds time elapsed ( +- 0.21% ) cache-misses are reduced by this patchset, roughly 5%. And elapsed times are improved by 1%. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-12-02 16:49:42 +08:00
return ((freelist_idx_t *)page->freelist)[idx];
}
static inline void set_free_obj(struct page *page,
slab: fix the type of the index on freelist index accessor Commit a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") changes the size of freelist index and also changes prototype of accessor function to freelist index. And there was a mistake. The mistake is that although it changes the size of freelist index correctly, it changes the size of the index of freelist index incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that means that num of object on on a slab can be more than 255. So we need more than 1 byte for the index to find the index of free object on freelist. But, above patch makes this index type 1 byte, so slab which have more than 255 objects cannot work properly and in consequence of it, the system cannot boot. This issue was reported by Steven King on m68knommu which would use 2 bytes freelist index: https://lkml.org/lkml/2014/4/16/433 To fix is easy. To change the type of the index of freelist index on accessor functions is enough to fix this bug. Although 2 bytes is enough, I use 4 bytes since it have no bad effect and make things more easier. This fix was suggested and tested by Steven in his original report. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18 15:24:09 +08:00
unsigned int idx, freelist_idx_t val)
{
slab: introduce byte sized index for the freelist of a slab Currently, the freelist of a slab consist of unsigned int sized indexes. Since most of slabs have less number of objects than 256, large sized indexes is needless. For example, consider the minimum kmalloc slab. It's object size is 32 byte and it would consist of one page, so 256 indexes through byte sized index are enough to contain all possible indexes. There can be some slabs whose object size is 8 byte. We cannot handle this case with byte sized index, so we need to restrict minimum object size. Since these slabs are not major, wasted memory from these slabs would be negligible. Some architectures' page size isn't 4096 bytes and rather larger than 4096 bytes (One example is 64KB page size on PPC or IA64) so that byte sized index doesn't fit to them. In this case, we will use two bytes sized index. Below is some number for this patch. * Before * kmalloc-512 525 640 512 8 1 : tunables 54 27 0 : slabdata 80 80 0 kmalloc-256 210 210 256 15 1 : tunables 120 60 0 : slabdata 14 14 0 kmalloc-192 1016 1040 192 20 1 : tunables 120 60 0 : slabdata 52 52 0 kmalloc-96 560 620 128 31 1 : tunables 120 60 0 : slabdata 20 20 0 kmalloc-64 2148 2280 64 60 1 : tunables 120 60 0 : slabdata 38 38 0 kmalloc-128 647 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11360 11413 32 113 1 : tunables 120 60 0 : slabdata 101 101 0 kmem_cache 197 200 192 20 1 : tunables 120 60 0 : slabdata 10 10 0 * After * kmalloc-512 521 648 512 8 1 : tunables 54 27 0 : slabdata 81 81 0 kmalloc-256 208 208 256 16 1 : tunables 120 60 0 : slabdata 13 13 0 kmalloc-192 1029 1029 192 21 1 : tunables 120 60 0 : slabdata 49 49 0 kmalloc-96 529 589 128 31 1 : tunables 120 60 0 : slabdata 19 19 0 kmalloc-64 2142 2142 64 63 1 : tunables 120 60 0 : slabdata 34 34 0 kmalloc-128 660 682 128 31 1 : tunables 120 60 0 : slabdata 22 22 0 kmalloc-32 11716 11780 32 124 1 : tunables 120 60 0 : slabdata 95 95 0 kmem_cache 197 210 192 21 1 : tunables 120 60 0 : slabdata 10 10 0 kmem_caches consisting of objects less than or equal to 256 byte have one or more objects than before. In the case of kmalloc-32, we have 11 more objects, so 352 bytes (11 * 32) are saved and this is roughly 9% saving of memory. Of couse, this percentage decreases as the number of objects in a slab decreases. Here are the performance results on my 4 cpus machine. * Before * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 229,945,138 cache-misses ( +- 0.23% ) 11.627897174 seconds time elapsed ( +- 0.14% ) * After * Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs): 218,640,472 cache-misses ( +- 0.42% ) 11.504999837 seconds time elapsed ( +- 0.21% ) cache-misses are reduced by this patchset, roughly 5%. And elapsed times are improved by 1%. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-12-02 16:49:42 +08:00
((freelist_idx_t *)(page->freelist))[idx] = val;
}
static void cache_init_objs(struct kmem_cache *cachep,
struct page *page)
{
int i;
for (i = 0; i < cachep->num; i++) {
void *objp = index_to_obj(cachep, page, i);
#if DEBUG
/* need to poison the objs? */
if (cachep->flags & SLAB_POISON)
poison_obj(cachep, objp, POISON_FREE);
if (cachep->flags & SLAB_STORE_USER)
*dbg_userword(cachep, objp) = NULL;
if (cachep->flags & SLAB_RED_ZONE) {
*dbg_redzone1(cachep, objp) = RED_INACTIVE;
*dbg_redzone2(cachep, objp) = RED_INACTIVE;
}
/*
* Constructors are not allowed to allocate memory from the same
* cache which they are a constructor for. Otherwise, deadlock.
* They must also be threaded.
*/
if (cachep->ctor && !(cachep->flags & SLAB_POISON))
cachep->ctor(objp + obj_offset(cachep));
if (cachep->flags & SLAB_RED_ZONE) {
if (*dbg_redzone2(cachep, objp) != RED_INACTIVE)
slab_error(cachep, "constructor overwrote the"
" end of an object");
if (*dbg_redzone1(cachep, objp) != RED_INACTIVE)
slab_error(cachep, "constructor overwrote the"
" start of an object");
}
if ((cachep->size % PAGE_SIZE) == 0 &&
OFF_SLAB(cachep) && cachep->flags & SLAB_POISON)
kernel_map_pages(virt_to_page(objp),
cachep->size / PAGE_SIZE, 0);
#else
if (cachep->ctor)
cachep->ctor(objp);
#endif
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
set_obj_status(page, i, OBJECT_FREE);
set_free_obj(page, i, i);
}
}
static void kmem_flagcheck(struct kmem_cache *cachep, gfp_t flags)
{
if (CONFIG_ZONE_DMA_FLAG) {
if (flags & GFP_DMA)
BUG_ON(!(cachep->allocflags & GFP_DMA));
else
BUG_ON(cachep->allocflags & GFP_DMA);
}
}
static void *slab_get_obj(struct kmem_cache *cachep, struct page *page,
int nodeid)
{
void *objp;
objp = index_to_obj(cachep, page, get_free_obj(page, page->active));
page->active++;
#if DEBUG
WARN_ON(page_to_nid(virt_to_page(objp)) != nodeid);
#endif
return objp;
}
static void slab_put_obj(struct kmem_cache *cachep, struct page *page,
void *objp, int nodeid)
{
unsigned int objnr = obj_to_index(cachep, page, objp);
#if DEBUG
unsigned int i;
/* Verify that the slab belongs to the intended node */
WARN_ON(page_to_nid(virt_to_page(objp)) != nodeid);
/* Verify double free bug */
for (i = page->active; i < cachep->num; i++) {
if (get_free_obj(page, i) == objnr) {
printk(KERN_ERR "slab: double free detected in cache "
"'%s', objp %p\n", cachep->name, objp);
BUG();
}
}
#endif
page->active--;
set_free_obj(page, page->active, objnr);
}
/*
* Map pages beginning at addr to the given cache and slab. This is required
* for the slab allocator to be able to lookup the cache and slab of a
* virtual address for kfree, ksize, and slab debugging.
*/
static void slab_map_pages(struct kmem_cache *cache, struct page *page,
void *freelist)
{
page->slab_cache = cache;
page->freelist = freelist;
}
/*
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
static int cache_grow(struct kmem_cache *cachep,
gfp_t flags, int nodeid, struct page *page)
{
void *freelist;
size_t offset;
gfp_t local_flags;
struct kmem_cache_node *n;
/*
* Be lazy and only check for valid flags here, keeping it out of the
* critical path in kmem_cache_alloc().
*/
if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
BUG();
}
Categorize GFP flags The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:25:41 +08:00
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
/* Take the node list lock to change the colour_next on this node */
check_irq_off();
n = get_node(cachep, nodeid);
spin_lock(&n->list_lock);
/* Get colour for the slab, and cal the next value. */
offset = n->colour_next;
n->colour_next++;
if (n->colour_next >= cachep->colour)
n->colour_next = 0;
spin_unlock(&n->list_lock);
offset *= cachep->colour_off;
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
if (gfpflags_allow_blocking(local_flags))
local_irq_enable();
/*
* The test for missing atomic flag is performed here, rather than
* the more obvious place, simply to reduce the critical path length
* in kmem_cache_alloc(). If a caller is seriously mis-behaving they
* will eventually be caught here (where it matters).
*/
kmem_flagcheck(cachep, flags);
/*
* Get mem for the objs. Attempt to allocate a physical page from
* 'nodeid'.
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
*/
if (!page)
page = kmem_getpages(cachep, local_flags, nodeid);
if (!page)
goto failed;
/* Get slab management. */
freelist = alloc_slabmgmt(cachep, page, offset,
Categorize GFP flags The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:25:41 +08:00
local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
if (!freelist)
goto opps1;
slab_map_pages(cachep, page, freelist);
cache_init_objs(cachep, page);
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
check_irq_off();
spin_lock(&n->list_lock);
/* Make slab active. */
list_add_tail(&page->lru, &(n->slabs_free));
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num;
spin_unlock(&n->list_lock);
return 1;
opps1:
kmem_freepages(cachep, page);
failed:
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
return 0;
}
#if DEBUG
/*
* Perform extra freeing checks:
* - detect bad pointers.
* - POISON/RED_ZONE checking
*/
static void kfree_debugcheck(const void *objp)
{
if (!virt_addr_valid(objp)) {
printk(KERN_ERR "kfree_debugcheck: out of range ptr %lxh.\n",
(unsigned long)objp);
BUG();
}
}
static inline void verify_redzone_free(struct kmem_cache *cache, void *obj)
{
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
unsigned long long redzone1, redzone2;
redzone1 = *dbg_redzone1(cache, obj);
redzone2 = *dbg_redzone2(cache, obj);
/*
* Redzone is ok.
*/
if (redzone1 == RED_ACTIVE && redzone2 == RED_ACTIVE)
return;
if (redzone1 == RED_INACTIVE && redzone2 == RED_INACTIVE)
slab_error(cache, "double free detected");
else
slab_error(cache, "memory outside object was overwritten");
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
printk(KERN_ERR "%p: redzone 1:0x%llx, redzone 2:0x%llx.\n",
obj, redzone1, redzone2);
}
static void *cache_free_debugcheck(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
unsigned int objnr;
struct page *page;
BUG_ON(virt_to_cache(objp) != cachep);
objp -= obj_offset(cachep);
kfree_debugcheck(objp);
page = virt_to_head_page(objp);
if (cachep->flags & SLAB_RED_ZONE) {
verify_redzone_free(cachep, objp);
*dbg_redzone1(cachep, objp) = RED_INACTIVE;
*dbg_redzone2(cachep, objp) = RED_INACTIVE;
}
if (cachep->flags & SLAB_STORE_USER)
*dbg_userword(cachep, objp) = (void *)caller;
objnr = obj_to_index(cachep, page, objp);
BUG_ON(objnr >= cachep->num);
BUG_ON(objp != index_to_obj(cachep, page, objnr));
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
set_obj_status(page, objnr, OBJECT_FREE);
if (cachep->flags & SLAB_POISON) {
#ifdef CONFIG_DEBUG_PAGEALLOC
if ((cachep->size % PAGE_SIZE)==0 && OFF_SLAB(cachep)) {
store_stackinfo(cachep, objp, caller);
kernel_map_pages(virt_to_page(objp),
cachep->size / PAGE_SIZE, 0);
} else {
poison_obj(cachep, objp, POISON_FREE);
}
#else
poison_obj(cachep, objp, POISON_FREE);
#endif
}
return objp;
}
#else
#define kfree_debugcheck(x) do { } while(0)
#define cache_free_debugcheck(x,objp,z) (objp)
#endif
static struct page *get_first_slab(struct kmem_cache_node *n)
{
struct page *page;
page = list_first_entry_or_null(&n->slabs_partial,
struct page, lru);
if (!page) {
n->free_touched = 1;
page = list_first_entry_or_null(&n->slabs_free,
struct page, lru);
}
return page;
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
bool force_refill)
{
int batchcount;
struct kmem_cache_node *n;
struct array_cache *ac;
int node;
check_irq_off();
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
node = numa_mem_id();
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
if (unlikely(force_refill))
goto force_grow;
retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*
* If there was little recent activity on this cache, then
* perform only a partial refill. Otherwise we could generate
* refill bouncing.
*/
batchcount = BATCHREFILL_LIMIT;
}
n = get_node(cachep, node);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
BUG_ON(ac->avail > 0 || !n);
spin_lock(&n->list_lock);
/* See if we can refill from the shared array */
if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
n->shared->touched = 1;
goto alloc_done;
}
while (batchcount > 0) {
struct page *page;
/* Get slab alloc is to come from. */
page = get_first_slab(n);
if (!page)
goto must_grow;
check_spinlock_acquired(cachep);
/*
* The slab was either on partial or free list so
* there must be at least one object available for
* allocation.
*/
BUG_ON(page->active >= cachep->num);
while (page->active < cachep->num && batchcount--) {
STATS_INC_ALLOCED(cachep);
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
ac_put_obj(cachep, ac, slab_get_obj(cachep, page,
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
node));
}
/* move slabp to correct slabp list: */
list_del(&page->lru);
if (page->active == cachep->num)
list_add(&page->lru, &n->slabs_full);
else
list_add(&page->lru, &n->slabs_partial);
}
must_grow:
n->free_objects -= ac->avail;
alloc_done:
spin_unlock(&n->list_lock);
if (unlikely(!ac->avail)) {
int x;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
force_grow:
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
node = numa_mem_id();
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
/* no objects in sight? abort */
if (!x && (ac->avail == 0 || force_refill))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
goto retry;
}
ac->touched = 1;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
return ac_get_obj(cachep, ac, flags, force_refill);
}
static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
gfp_t flags)
{
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
might_sleep_if(gfpflags_allow_blocking(flags));
#if DEBUG
kmem_flagcheck(cachep, flags);
#endif
}
#if DEBUG
static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep,
gfp_t flags, void *objp, unsigned long caller)
{
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
struct page *page;
if (!objp)
return objp;
if (cachep->flags & SLAB_POISON) {
#ifdef CONFIG_DEBUG_PAGEALLOC
if ((cachep->size % PAGE_SIZE) == 0 && OFF_SLAB(cachep))
kernel_map_pages(virt_to_page(objp),
cachep->size / PAGE_SIZE, 1);
else
check_poison_obj(cachep, objp);
#else
check_poison_obj(cachep, objp);
#endif
poison_obj(cachep, objp, POISON_INUSE);
}
if (cachep->flags & SLAB_STORE_USER)
*dbg_userword(cachep, objp) = (void *)caller;
if (cachep->flags & SLAB_RED_ZONE) {
if (*dbg_redzone1(cachep, objp) != RED_INACTIVE ||
*dbg_redzone2(cachep, objp) != RED_INACTIVE) {
slab_error(cachep, "double free, or memory outside"
" object was overwritten");
printk(KERN_ERR
Increase slab redzone to 64bits There are two problems with the existing redzone implementation. Firstly, it's causing misalignment of structures which contain a 64-bit integer, such as netfilter's 'struct ipt_entry' -- causing netfilter modules to fail to load because of the misalignment. (In particular, the first check in net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks()) On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8. With slab debugging, we use 32-bit redzones. And allocated slab objects aren't sufficiently aligned to hold a structure containing a uint64_t. By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable redzone checks on those architectures. By using 64-bit redzones we avoid that loss of debugging, and also fix the other problem while we're at it. When investigating this, I noticed that on 64-bit platforms we're using a 32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set aside for the redzone. Which means that the four bytes immediately before or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE machines, respectively. Which is probably not the most useful choice of poison value. One way to fix both of those at once is just to switch to 64-bit redzones in all cases. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 15:22:59 +08:00
"%p: redzone 1:0x%llx, redzone 2:0x%llx\n",
objp, *dbg_redzone1(cachep, objp),
*dbg_redzone2(cachep, objp));
}
*dbg_redzone1(cachep, objp) = RED_ACTIVE;
*dbg_redzone2(cachep, objp) = RED_ACTIVE;
}
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
page = virt_to_head_page(objp);
set_obj_status(page, obj_to_index(cachep, page, objp), OBJECT_ACTIVE);
objp += obj_offset(cachep);
if (cachep->ctor && cachep->flags & SLAB_POISON)
cachep->ctor(objp);
if (ARCH_SLAB_MINALIGN &&
((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))) {
printk(KERN_ERR "0x%p: not aligned to ARCH_SLAB_MINALIGN=%d\n",
objp, (int)ARCH_SLAB_MINALIGN);
}
return objp;
}
#else
#define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
#endif
static bool slab_should_failslab(struct kmem_cache *cachep, gfp_t flags)
{
slab: add unlikely macro to help compiler This patchset does some cleanup and tries to remove lockdep annotation. Patches 1~2 are just for really really minor improvement. Patches 3~9 are for clean-up and removing lockdep annotation. There are two cases that lockdep annotation is needed in SLAB. 1) holding two node locks 2) holding two array cache(alien cache) locks I looked at the code and found that we can avoid these cases without any negative effect. 1) occurs if freeing object makes new free slab and we decide to destroy it. Although we don't need to hold the lock during destroying a slab, current code do that. Destroying a slab without holding the lock would help the reduction of the lock contention. To do it, I change the implementation that new free slab is destroyed after releasing the lock. 2) occurs on similar situation. When we free object from non-local node, we put this object to alien cache with holding the alien cache lock. If alien cache is full, we try to flush alien cache to proper node cache, and, in this time, new free slab could be made. Destroying it would be started and we will free metadata object which comes from another node. In this case, we need another node's alien cache lock to free object. This forces us to hold two array cache locks and then we need lockdep annotation although they are always different locks and deadlock cannot be possible. To prevent this situation, I use same way as 1). In this way, we can avoid 1) and 2) cases, and then, can remove lockdep annotation. As short stat noted, this makes SLAB code much simpler. This patch (of 9): slab_should_failslab() is called on every allocation, so to optimize it is reasonable. We normally don't allocate from kmem_cache. It is just used when new kmem_cache is created, so it's very rare case. Therefore, add unlikely macro to help compiler optimization. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07 07:04:20 +08:00
if (unlikely(cachep == kmem_cache))
return false;
return should_failslab(cachep->object_size, flags, cachep->flags);
}
static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *objp;
struct array_cache *ac;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
bool force_refill = false;
check_irq_off();
ac = cpu_cache_get(cachep);
if (likely(ac->avail)) {
ac->touched = 1;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
objp = ac_get_obj(cachep, ac, flags, false);
/*
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
* Allow for the possibility all avail objects are not allowed
* by the current flags
*/
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
if (objp) {
STATS_INC_ALLOCHIT(cachep);
goto out;
}
force_refill = true;
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
STATS_INC_ALLOCMISS(cachep);
objp = cache_alloc_refill(cachep, flags, force_refill);
/*
* the 'ac' may be updated by cache_alloc_refill(),
* and kmemleak_erase() requires its correct value.
*/
ac = cpu_cache_get(cachep);
out:
/*
* To avoid a false negative, if an object that is in one of the
* per-CPU caches is leaked, we need to make sure kmemleak doesn't
* treat the array pointers as a reference to the object.
*/
if (objp)
kmemleak_erase(&ac->entry[ac->avail]);
return objp;
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
#ifdef CONFIG_NUMA
/*
* Try allocating on another node if PFA_SPREAD_SLAB is a mempolicy is set.
*
* If we are in_interrupt, then process context, including cpusets and
* mempolicy, may not apply and should not be used for allocation policy.
*/
static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
{
int nid_alloc, nid_here;
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
if (in_interrupt() || (flags & __GFP_THISNODE))
return NULL;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
nid_alloc = nid_here = numa_mem_id();
if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
cpusets: new round-robin rotor for SLAB allocations We have observed several workloads running on multi-node systems where memory is assigned unevenly across the nodes in the system. There are numerous reasons for this but one is the round-robin rotor in cpuset_mem_spread_node(). For example, a simple test that writes a multi-page file will allocate pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it allocates on odd nodes & skips even nodes). An example is shown below. The program "lfile" writes a file consisting of 10 pages. The program then mmaps the file & uses get_mempolicy(..., MPOL_F_NODE) to determine the nodes where the file pages were allocated. The output is shown below: # ./lfile allocated on nodes: 2 4 6 0 1 2 6 0 2 There is a single rotor that is used for allocating both file pages & slab pages. Writing the file allocates both a data page & a slab page (buffer_head). This advances the RR rotor 2 nodes for each page allocated. A quick confirmation seems to confirm this is the cause of the uneven allocation: # echo 0 >/dev/cpuset/memory_spread_slab # ./lfile allocated on nodes: 6 7 8 9 0 1 2 3 4 5 This patch introduces a second rotor that is used for slab allocations. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:42:49 +08:00
nid_alloc = cpuset_slab_spread_node();
else if (current->mempolicy)
nid_alloc = mempolicy_slab_node();
if (nid_alloc != nid_here)
return ____cache_alloc_node(cachep, flags, nid_alloc);
return NULL;
}
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
/*
* Fallback function if there was no memory available and no objects on a
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
* certain node and fall back is permitted. First we scan all the
* available node for available objects. If that fails then we
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
* perform an allocation without specifying a node. This allows the page
* allocator to do its reclaim / fallback magic. We then insert the
* slab into the proper nodelist and then allocate from it.
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
*/
static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
{
struct zonelist *zonelist;
gfp_t local_flags;
mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:17 +08:00
struct zoneref *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
void *obj = NULL;
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
int nid;
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
unsigned int cpuset_mems_cookie;
if (flags & __GFP_THISNODE)
return NULL;
Categorize GFP flags The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:25:41 +08:00
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(mempolicy_slab_node(), flags);
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
retry:
/*
* Look through allowed nodes for objects available
* from existing per node queues.
*/
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
nid = zone_to_nid(zone);
slab: fix cpuset check in fallback_alloc fallback_alloc is called on kmalloc if the preferred node doesn't have free or partial slabs and there's no pages on the node's free list (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries to locate a free or partial slab on other allowed nodes' lists. While iterating over the preferred node's zonelist it skips those zones which hardwall cpuset check returns false for. That means that for a task bound to a specific node using cpusets fallback_alloc will always ignore free slabs on other nodes and go directly to the reclaimer, which, however, may allocate from other nodes if cpuset.mem_hardwall is unset (default). As a result, we may get lists of free slabs grow without bounds on other nodes, which is bad, because inactive slabs are only evicted by cache_reap at a very slow rate and cannot be dropped forcefully. To reproduce the issue, run a process that will walk over a directory tree with lots of files inside a cpuset bound to a node that constantly experiences memory pressure. Look at num_slabs vs active_slabs growth as reported by /proc/slabinfo. To avoid this we should use softwall cpuset check in fallback_alloc. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 08:58:25 +08:00
if (cpuset_zone_allowed(zone, flags) &&
get_node(cache, nid) &&
get_node(cache, nid)->free_objects) {
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
obj = ____cache_alloc_node(cache,
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
gfp_exact_node(flags), nid);
if (obj)
break;
}
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
}
if (!obj) {
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
/*
* This allocation will be performed within the constraints
* of the current cpuset / memory policy requirements.
* We may trigger various forms of reclaim on the allowed
* set and go into memory reserves if necessary.
*/
struct page *page;
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
if (gfpflags_allow_blocking(local_flags))
local_irq_enable();
kmem_flagcheck(cache, flags);
page = kmem_getpages(cache, local_flags, numa_mem_id());
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 08:28:21 +08:00
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
if (page) {
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
/*
* Insert into the appropriate per node queues
*/
nid = page_to_nid(page);
if (cache_grow(cache, flags, nid, page)) {
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
obj = ____cache_alloc_node(cache,
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
gfp_exact_node(flags), nid);
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
if (!obj)
/*
* Another processor may allocate the
* objects in the slab since we are
* not holding any locks.
*/
goto retry;
} else {
/* cache_grow already freed obj */
[PATCH] slab: better fallback allocation behavior Currently we simply attempt to allocate from all allowed nodes using GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at all if the recent GFP_THISNODE patch is accepted). If we truly run out of memory in the whole system then fallback_alloc may return NULL although memory may still be available if we would perform more thorough reclaim. This patch changes fallback_alloc() so that we first only inspect all the per node queues for available slabs. If we find any then we allocate from those. This avoids slab fragmentation by first getting rid of all partial allocated slabs on every node before allocating new memory. If we cannot satisfy the allocation from any per node queue then we extend a slab. We now call into the page allocator without specifying GFP_THISNODE. The page allocator will then implement its own fallback (in the given cpuset context), perform necessary reclaim (again considering not a single node but the whole set of allowed nodes) and then return pages for a new slab. We identify from which node the pages were allocated and then insert the pages into the corresponding per node structure. In order to do so we need to modify cache_grow() to take a parameter that specifies the new slab. kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE needs to be specified when calling cache_grow(). One key advantage is that the decision from which node to allocate new memory is removed from slab fallback processing. The patch allows to go back to use of the page allocators fallback/reclaim logic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 12:33:29 +08:00
obj = NULL;
}
}
}
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
goto retry_cpuset;
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
return obj;
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/*
* A interface to enable slab creation on nodeid
*/
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
int nodeid)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
struct page *page;
struct kmem_cache_node *n;
void *obj;
int x;
slab: fix nodeid bounds check for non-contiguous node IDs The bounds check for nodeid in ____cache_alloc_node gives false positives on machines where the node IDs are not contiguous, leading to a panic at boot time. For example, on a POWER8 machine the node IDs are typically 0, 1, 16 and 17. This means that num_online_nodes() returns 4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON triggers, like this: kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079! Call Trace: .____cache_alloc_node+0x5c/0x270 (unreliable) .kmem_cache_alloc_node_trace+0xdc/0x360 .init_list+0x3c/0x128 .kmem_cache_init+0x1dc/0x258 .start_kernel+0x2a0/0x568 start_here_common+0x20/0xa8 To fix this, we instead compare the nodeid with MAX_NUMNODES, and additionally make sure it isn't negative (since nodeid is an int). The check is there mainly to protect the array dereference in the get_node() call in the next line, and the array being dereferenced is of size MAX_NUMNODES. If the nodeid is in range but invalid (for example if the node is off-line), the BUG_ON in the next line will catch that. Fixes: 14e50c6a9bc2 ("mm: slab: Verify the nodeid passed to ____cache_alloc_node") Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 07:59:48 +08:00
VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
n = get_node(cachep, nodeid);
BUG_ON(!n);
retry:
check_irq_off();
spin_lock(&n->list_lock);
page = get_first_slab(n);
if (!page)
goto must_grow;
check_spinlock_acquired_node(cachep, nodeid);
STATS_INC_NODEALLOCS(cachep);
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
BUG_ON(page->active == cachep->num);
obj = slab_get_obj(cachep, page, nodeid);
n->free_objects--;
/* move slabp to correct slabp list: */
list_del(&page->lru);
if (page->active == cachep->num)
list_add(&page->lru, &n->slabs_full);
else
list_add(&page->lru, &n->slabs_partial);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
spin_unlock(&n->list_lock);
goto done;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
must_grow:
spin_unlock(&n->list_lock);
mm: remove GFP_THISNODE NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 06:46:55 +08:00
x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
[PATCH] GFP_THISNODE for the slab allocator This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 16:50:08 +08:00
if (x)
goto retry;
return fallback_alloc(cachep, flags);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
done:
return obj;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
static __always_inline void *
slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
unsigned long caller)
{
unsigned long save_flags;
void *ptr;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int slab_node = numa_mem_id();
flags &= gfp_allowed_mask;
lockdep: annotate reclaim context (__GFP_NOFS) Here is another version, with the incremental patch rolled up, and added reclaim context annotation to kswapd, and allocation tracing to slab allocators (which may only ever reach the page allocator in rare cases, so it is good to put annotations here too). Haven't tested this version as such, but it should be getting closer to merge worthy ;) -- After noticing some code in mm/filemap.c accidentally perform a __GFP_FS allocation when it should not have been, I thought it might be a good idea to try to catch this kind of thing with lockdep. I coded up a little idea that seems to work. Unfortunately the system has to actually be in __GFP_FS page reclaim, then take the lock, before it will mark it. But at least that might still be some orders of magnitude more common (and more debuggable) than an actual deadlock condition, so we have some improvement I hope (the concept is no less complete than discovery of a lock's interrupt contexts). I guess we could even do the same thing with __GFP_IO (normal reclaim), and even GFP_NOIO locks too... but filesystems will have the most locks and fiddly code paths, so let's start there and see how it goes. It *seems* to work. I did a quick test. ================================= [ INFO: inconsistent lock state ] 2.6.28-rc6-00007-ged31348-dirty #26 --------------------------------- inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage. modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] {in-reclaim-W} state was registered at: [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60 [<ffffffff80268f71>] lock_acquire+0x91/0xc0 [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310 [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 3929 hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310 hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310 softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0 softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0 other info that might help us debug this: 1 lock held by modprobe/8526: #0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] stack backtrace: Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26 Call Trace: [<ffffffff80265483>] print_usage_bug+0x193/0x1d0 [<ffffffff80266530>] mark_lock+0xaf0/0xca0 [<ffffffff80266735>] mark_held_locks+0x55/0xc0 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd] [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:12:39 +08:00
lockdep_trace_alloc(flags);
if (slab_should_failslab(cachep, flags))
return NULL;
cachep = memcg_kmem_get_cache(cachep, flags);
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
if (nodeid == NUMA_NO_NODE)
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
nodeid = slab_node;
if (unlikely(!get_node(cachep, nodeid))) {
/* Node not bootstrapped yet */
ptr = fallback_alloc(cachep, flags);
goto out;
}
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
if (nodeid == slab_node) {
/*
* Use the locally cached objects if possible.
* However ____cache_alloc does not allow fallback
* to other nodes. It may fail while we still have
* objects on other nodes available.
*/
ptr = ____cache_alloc(cachep, flags);
if (ptr)
goto out;
}
/* ___cache_alloc_node can fall back to other nodes */
ptr = ____cache_alloc_node(cachep, flags, nodeid);
out:
local_irq_restore(save_flags);
ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
flags);
if (likely(ptr)) {
kmemcheck_slab_alloc(cachep, flags, ptr, cachep->object_size);
if (unlikely(flags & __GFP_ZERO))
memset(ptr, 0, cachep->object_size);
}
memcg: fix possible use-after-free in memcg_kmem_get_cache() Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 08:56:38 +08:00
memcg_kmem_put_cache(cachep);
return ptr;
}
static __always_inline void *
__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
{
void *objp;
if (current->mempolicy || cpuset_do_slab_mem_spread()) {
objp = alternate_node_alloc(cache, flags);
if (objp)
goto out;
}
objp = ____cache_alloc(cache, flags);
/*
* We may just have run out of memory on the local node.
* ____cache_alloc_node() knows how to locate memory on other nodes
*/
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
if (!objp)
objp = ____cache_alloc_node(cache, flags, numa_mem_id());
out:
return objp;
}
#else
static __always_inline void *
__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
return ____cache_alloc(cachep, flags);
}
#endif /* CONFIG_NUMA */
static __always_inline void *
slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
{
unsigned long save_flags;
void *objp;
flags &= gfp_allowed_mask;
lockdep: annotate reclaim context (__GFP_NOFS) Here is another version, with the incremental patch rolled up, and added reclaim context annotation to kswapd, and allocation tracing to slab allocators (which may only ever reach the page allocator in rare cases, so it is good to put annotations here too). Haven't tested this version as such, but it should be getting closer to merge worthy ;) -- After noticing some code in mm/filemap.c accidentally perform a __GFP_FS allocation when it should not have been, I thought it might be a good idea to try to catch this kind of thing with lockdep. I coded up a little idea that seems to work. Unfortunately the system has to actually be in __GFP_FS page reclaim, then take the lock, before it will mark it. But at least that might still be some orders of magnitude more common (and more debuggable) than an actual deadlock condition, so we have some improvement I hope (the concept is no less complete than discovery of a lock's interrupt contexts). I guess we could even do the same thing with __GFP_IO (normal reclaim), and even GFP_NOIO locks too... but filesystems will have the most locks and fiddly code paths, so let's start there and see how it goes. It *seems* to work. I did a quick test. ================================= [ INFO: inconsistent lock state ] 2.6.28-rc6-00007-ged31348-dirty #26 --------------------------------- inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage. modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] {in-reclaim-W} state was registered at: [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60 [<ffffffff80268f71>] lock_acquire+0x91/0xc0 [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310 [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 3929 hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310 hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310 softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0 softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0 other info that might help us debug this: 1 lock held by modprobe/8526: #0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] stack backtrace: Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26 Call Trace: [<ffffffff80265483>] print_usage_bug+0x193/0x1d0 [<ffffffff80266530>] mark_lock+0xaf0/0xca0 [<ffffffff80266735>] mark_held_locks+0x55/0xc0 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd] [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:12:39 +08:00
lockdep_trace_alloc(flags);
if (slab_should_failslab(cachep, flags))
return NULL;
cachep = memcg_kmem_get_cache(cachep, flags);
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
objp = __do_cache_alloc(cachep, flags);
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
flags);
prefetchw(objp);
if (likely(objp)) {
kmemcheck_slab_alloc(cachep, flags, objp, cachep->object_size);
if (unlikely(flags & __GFP_ZERO))
memset(objp, 0, cachep->object_size);
}
memcg: fix possible use-after-free in memcg_kmem_get_cache() Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 08:56:38 +08:00
memcg_kmem_put_cache(cachep);
return objp;
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/*
* Caller needs to acquire correct kmem_cache_node's list_lock
* @list: List of detached free slabs should be freed by caller
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
*/
static void free_block(struct kmem_cache *cachep, void **objpp,
int nr_objects, int node, struct list_head *list)
{
int i;
struct kmem_cache_node *n = get_node(cachep, node);
for (i = 0; i < nr_objects; i++) {
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
void *objp;
struct page *page;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
clear_obj_pfmemalloc(&objpp[i]);
objp = objpp[i];
page = virt_to_head_page(objp);
list_del(&page->lru);
check_spinlock_acquired_node(cachep, node);
slab_put_obj(cachep, page, objp, node);
STATS_DEC_ACTIVE(cachep);
n->free_objects++;
/* fixup slab chains */
if (page->active == 0) {
if (n->free_objects > n->free_limit) {
n->free_objects -= cachep->num;
list_add_tail(&page->lru, list);
} else {
list_add(&page->lru, &n->slabs_free);
}
} else {
/* Unconditionally move a slab to the end of the
* partial list on free - maximum time for the
* other objects to be freed, too.
*/
list_add_tail(&page->lru, &n->slabs_partial);
}
}
}
static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
{
int batchcount;
struct kmem_cache_node *n;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int node = numa_mem_id();
LIST_HEAD(list);
batchcount = ac->batchcount;
#if DEBUG
BUG_ON(!batchcount || batchcount > ac->avail);
#endif
check_irq_off();
n = get_node(cachep, node);
spin_lock(&n->list_lock);
if (n->shared) {
struct array_cache *shared_array = n->shared;
int max = shared_array->limit - shared_array->avail;
if (max) {
if (batchcount > max)
batchcount = max;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
memcpy(&(shared_array->entry[shared_array->avail]),
ac->entry, sizeof(void *) * batchcount);
shared_array->avail += batchcount;
goto free_done;
}
}
free_block(cachep, ac->entry, batchcount, node, &list);
free_done:
#if STATS
{
int i = 0;
struct page *page;
list_for_each_entry(page, &n->slabs_free, lru) {
BUG_ON(page->active);
i++;
}
STATS_SET_FREEABLE(cachep, i);
}
#endif
spin_unlock(&n->list_lock);
slabs_destroy(cachep, &list);
ac->avail -= batchcount;
memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
}
/*
* Release an obj back to its cache. If the obj has a constructed state, it must
* be in this state _before_ it is released. Called with disabled ints.
*/
static inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
struct array_cache *ac = cpu_cache_get(cachep);
check_irq_off();
kmemleak_free_recursive(objp, cachep->flags);
objp = cache_free_debugcheck(cachep, objp, caller);
kmemcheck_slab_free(cachep, objp, cachep->object_size);
/*
* Skip calling cache_free_alien() when the platform is not numa.
* This will avoid cache misses that happen while accessing slabp (which
* is per page memory reference) to get nodeid. Instead use a global
* variable to skip the call, which is mostly likely to be present in
* the cache.
*/
if (nr_online_nodes > 1 && cache_free_alien(cachep, objp))
return;
if (ac->avail < ac->limit) {
STATS_INC_FREEHIT(cachep);
} else {
STATS_INC_FREEMISS(cachep);
cache_flusharray(cachep, ac);
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:43:58 +08:00
ac_put_obj(cachep, ac, objp);
}
/**
* kmem_cache_alloc - Allocate an object
* @cachep: The cache to allocate from.
* @flags: See kmalloc().
*
* Allocate an object from this cache. The flags are only relevant
* if the cache has no available objects.
*/
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *ret = slab_alloc(cachep, flags, _RET_IP_);
trace_kmem_cache_alloc(_RET_IP_, ret,
cachep->object_size, cachep->size, flags);
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc);
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
{
__kmem_cache_free_bulk(s, size, p);
}
EXPORT_SYMBOL(kmem_cache_free_bulk);
int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
return __kmem_cache_alloc_bulk(s, flags, size, p);
}
EXPORT_SYMBOL(kmem_cache_alloc_bulk);
#ifdef CONFIG_TRACING
void *
kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
void *ret;
ret = slab_alloc(cachep, flags, _RET_IP_);
trace_kmalloc(_RET_IP_, ret,
size, cachep->size, flags);
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
#endif
#ifdef CONFIG_NUMA
/**
* kmem_cache_alloc_node - Allocate an object on the specified node
* @cachep: The cache to allocate from.
* @flags: See kmalloc().
* @nodeid: node number of the target node.
*
* Identical to kmem_cache_alloc but it will allocate memory on the given
* node, which can improve the performance for cpu bound structures.
*
* Fallback to other node is possible if __GFP_THISNODE is not set.
*/
void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
trace_kmem_cache_alloc_node(_RET_IP_, ret,
cachep->object_size, cachep->size,
flags, nodeid);
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_node);
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
gfp_t flags,
int nodeid,
size_t size)
{
void *ret;
ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
trace_kmalloc_node(_RET_IP_, ret,
size, cachep->size,
flags, nodeid);
return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
#endif
static __always_inline void *
__do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)
[PATCH] add kmalloc_node, inline cleanup The patch makes the following function calls available to allocate memory on a specific node without changing the basic operation of the slab allocator: kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node); kmalloc_node(size_t size, unsigned int flags, int node); in a similar way to the existing node-blind functions: kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags); kmalloc(size, flags); kmem_cache_alloc_node was changed to pass flags and the node information through the existing layers of the slab allocator (which lead to some minor rearrangements). The functions at the lowest layer (kmem_getpages, cache_grow) are already node aware. Also __alloc_percpu can call kmalloc_node now. Performance measurements (using the pageset localization patch) yields: w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005 100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005 200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005 300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005 400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005 500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005 600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005 700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005 800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005 900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005 1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005 with pageset localization and slab API patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005 100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005 200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005 300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005 400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005 500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005 600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005 700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005 800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005 900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005 1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005 Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:58:38 +08:00
{
struct kmem_cache *cachep;
[PATCH] add kmalloc_node, inline cleanup The patch makes the following function calls available to allocate memory on a specific node without changing the basic operation of the slab allocator: kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node); kmalloc_node(size_t size, unsigned int flags, int node); in a similar way to the existing node-blind functions: kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags); kmalloc(size, flags); kmem_cache_alloc_node was changed to pass flags and the node information through the existing layers of the slab allocator (which lead to some minor rearrangements). The functions at the lowest layer (kmem_getpages, cache_grow) are already node aware. Also __alloc_percpu can call kmalloc_node now. Performance measurements (using the pageset localization patch) yields: w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005 100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005 200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005 300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005 400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005 500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005 600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005 700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005 800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005 900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005 1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005 with pageset localization and slab API patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005 100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005 200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005 300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005 400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005 500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005 600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005 700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005 800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005 900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005 1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005 Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:58:38 +08:00
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
return kmem_cache_alloc_node_trace(cachep, flags, node, size);
[PATCH] add kmalloc_node, inline cleanup The patch makes the following function calls available to allocate memory on a specific node without changing the basic operation of the slab allocator: kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node); kmalloc_node(size_t size, unsigned int flags, int node); in a similar way to the existing node-blind functions: kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags); kmalloc(size, flags); kmem_cache_alloc_node was changed to pass flags and the node information through the existing layers of the slab allocator (which lead to some minor rearrangements). The functions at the lowest layer (kmem_getpages, cache_grow) are already node aware. Also __alloc_percpu can call kmalloc_node now. Performance measurements (using the pageset localization patch) yields: w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005 100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005 200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005 300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005 400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005 500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005 600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005 700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005 800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005 900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005 1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005 with pageset localization and slab API patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005 100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005 200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005 300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005 400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005 500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005 600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005 700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005 800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005 900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005 1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005 Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:58:38 +08:00
}
void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
return __do_kmalloc_node(size, flags, node, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc_node);
void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
int node, unsigned long caller)
{
return __do_kmalloc_node(size, flags, node, caller);
}
EXPORT_SYMBOL(__kmalloc_node_track_caller);
#endif /* CONFIG_NUMA */
/**
* __do_kmalloc - allocate memory
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
* @caller: function caller for debug tracking of the caller
*/
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
unsigned long caller)
{
struct kmem_cache *cachep;
void *ret;
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
ret = slab_alloc(cachep, flags, caller);
trace_kmalloc(caller, ret,
size, cachep->size, flags);
return ret;
}
void *__kmalloc(size_t size, gfp_t flags)
{
return __do_kmalloc(size, flags, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc);
void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
{
return __do_kmalloc(size, flags, caller);
}
EXPORT_SYMBOL(__kmalloc_track_caller);
/**
* kmem_cache_free - Deallocate an object
* @cachep: The cache the allocation was from.
* @objp: The previously allocated object.
*
* Free an object which was previously allocated from this
* cache.
*/
void kmem_cache_free(struct kmem_cache *cachep, void *objp)
{
unsigned long flags;
cachep = cache_from_obj(cachep, objp);
if (!cachep)
return;
local_irq_save(flags);
debug_check_no_locks_freed(objp, cachep->object_size);
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
if (!(cachep->flags & SLAB_DEBUG_OBJECTS))
debug_check_no_obj_freed(objp, cachep->object_size);
__cache_free(cachep, objp, _RET_IP_);
local_irq_restore(flags);
trace_kmem_cache_free(_RET_IP_, objp);
}
EXPORT_SYMBOL(kmem_cache_free);
/**
* kfree - free previously allocated memory
* @objp: pointer returned by kmalloc.
*
* If @objp is NULL, no operation is performed.
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
void kfree(const void *objp)
{
struct kmem_cache *c;
unsigned long flags;
trace_kfree(_RET_IP_, objp);
if (unlikely(ZERO_OR_NULL_PTR(objp)))
return;
local_irq_save(flags);
kfree_debugcheck(objp);
c = virt_to_cache(objp);
debug_check_no_locks_freed(objp, c->object_size);
debug_check_no_obj_freed(objp, c->object_size);
__cache_free(c, (void *)objp, _RET_IP_);
local_irq_restore(flags);
}
EXPORT_SYMBOL(kfree);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
/*
* This initializes kmem_cache_node or resizes various caches for all nodes.
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
*/
static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
{
int node;
struct kmem_cache_node *n;
struct array_cache *new_shared;
struct alien_cache **new_alien = NULL;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
for_each_online_node(node) {
if (use_alien_caches) {
new_alien = alloc_alien_cache(node, cachep->limit, gfp);
if (!new_alien)
goto fail;
}
new_shared = NULL;
if (cachep->shared) {
new_shared = alloc_arraycache(node,
cachep->shared*cachep->batchcount,
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
0xbaadf00d, gfp);
if (!new_shared) {
free_alien_cache(new_alien);
goto fail;
}
}
n = get_node(cachep, node);
if (n) {
struct array_cache *shared = n->shared;
LIST_HEAD(list);
spin_lock_irq(&n->list_lock);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
if (shared)
free_block(cachep, shared->entry,
shared->avail, node, &list);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
n->shared = new_shared;
if (!n->alien) {
n->alien = new_alien;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
new_alien = NULL;
}
n->free_limit = (1 + nr_cpus_node(node)) *
cachep->batchcount + cachep->num;
spin_unlock_irq(&n->list_lock);
slabs_destroy(cachep, &list);
kfree(shared);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
free_alien_cache(new_alien);
continue;
}
n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
if (!n) {
free_alien_cache(new_alien);
kfree(new_shared);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
goto fail;
}
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
kmem_cache_node_init(n);
n->next_reap = jiffies + REAPTIMEOUT_NODE +
((unsigned long)cachep) % REAPTIMEOUT_NODE;
n->shared = new_shared;
n->alien = new_alien;
n->free_limit = (1 + nr_cpus_node(node)) *
cachep->batchcount + cachep->num;
cachep->node[node] = n;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
return 0;
fail:
if (!cachep->list.next) {
/* Cache is not active yet. Roll back what we did */
node--;
while (node >= 0) {
n = get_node(cachep, node);
if (n) {
kfree(n->shared);
free_alien_cache(n->alien);
kfree(n);
cachep->node[node] = NULL;
}
node--;
}
}
return -ENOMEM;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
}
/* Always called with the slab_mutex held */
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
int batchcount, int shared, gfp_t gfp)
{
struct array_cache __percpu *cpu_cache, *prev;
int cpu;
cpu_cache = alloc_kmem_cache_cpus(cachep, limit, batchcount);
if (!cpu_cache)
return -ENOMEM;
prev = cachep->cpu_cache;
cachep->cpu_cache = cpu_cache;
kick_all_cpus_sync();
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
check_irq_on();
cachep->batchcount = batchcount;
cachep->limit = limit;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
cachep->shared = shared;
if (!prev)
goto alloc_node;
for_each_online_cpu(cpu) {
LIST_HEAD(list);
int node;
struct kmem_cache_node *n;
struct array_cache *ac = per_cpu_ptr(prev, cpu);
node = cpu_to_mem(cpu);
n = get_node(cachep, node);
spin_lock_irq(&n->list_lock);
free_block(cachep, ac->entry, ac->avail, node, &list);
spin_unlock_irq(&n->list_lock);
slabs_destroy(cachep, &list);
}
free_percpu(prev);
alloc_node:
return alloc_kmem_cache_node(cachep, gfp);
}
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
int batchcount, int shared, gfp_t gfp)
{
int ret;
struct kmem_cache *c;
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
if (slab_state < FULL)
return ret;
if ((ret < 0) || !is_root_cache(cachep))
return ret;
lockdep_assert_held(&slab_mutex);
for_each_memcg_cache(c, cachep) {
/* return value determined by the root cache only */
__do_tune_cpucache(c, limit, batchcount, shared, gfp);
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
}
return ret;
}
/* Called with slab_mutex held always */
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
{
int err;
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
int limit = 0;
int shared = 0;
int batchcount = 0;
if (!is_root_cache(cachep)) {
struct kmem_cache *root = memcg_root_cache(cachep);
limit = root->limit;
shared = root->shared;
batchcount = root->batchcount;
}
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
if (limit && shared && batchcount)
goto skip_setup;
/*
* The head array serves three purposes:
* - create a LIFO ordering, i.e. return objects that are cache-warm
* - reduce the number of spinlock operations.
* - reduce the number of linked list operations on the slab and
* bufctl chains: array operations are cheaper.
* The numbers are guessed, we should auto-tune as described by
* Bonwick.
*/
if (cachep->size > 131072)
limit = 1;
else if (cachep->size > PAGE_SIZE)
limit = 8;
else if (cachep->size > 1024)
limit = 24;
else if (cachep->size > 256)
limit = 54;
else
limit = 120;
/*
* CPU bound tasks (e.g. network routing) can exhibit cpu bound
* allocation behaviour: Most allocs on one cpu, most free operations
* on another cpu. For these cases, an efficient object passing between
* cpus is necessary. This is provided by a shared array. The array
* replaces Bonwick's magazine layer.
* On uniprocessor, it's functionally equivalent (but less efficient)
* to a larger limit. Thus disabled by default.
*/
shared = 0;
if (cachep->size <= PAGE_SIZE && num_possible_cpus() > 1)
shared = 8;
#if DEBUG
/*
* With debugging enabled, large batchcount lead to excessively long
* periods with disabled local interrupts. Limit the batchcount
*/
if (limit > 32)
limit = 32;
#endif
slab: propagate tunable values SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:23:03 +08:00
batchcount = (limit + 1) / 2;
skip_setup:
err = do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
cachep->name, -err);
return err;
}
/*
* Drain an array if it contains any elements taking the node lock only if
* necessary. Note that the node listlock also protects the array_cache
* if drain_array() is used on the shared array.
*/
static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
struct array_cache *ac, int force, int node)
{
LIST_HEAD(list);
int tofree;
if (!ac || !ac->avail)
return;
if (ac->touched && !force) {
ac->touched = 0;
} else {
spin_lock_irq(&n->list_lock);
if (ac->avail) {
tofree = force ? ac->avail : (ac->limit + 4) / 5;
if (tofree > ac->avail)
tofree = (ac->avail + 1) / 2;
free_block(cachep, ac->entry, tofree, node, &list);
ac->avail -= tofree;
memmove(ac->entry, &(ac->entry[tofree]),
sizeof(void *) * ac->avail);
}
spin_unlock_irq(&n->list_lock);
slabs_destroy(cachep, &list);
}
}
/**
* cache_reap - Reclaim memory from caches.
* @w: work descriptor
*
* Called from workqueue/eventd every few seconds.
* Purpose:
* - clear the per-cpu caches for this CPU.
* - return freeable pages to the main free memory pool.
*
* If we cannot acquire the cache chain mutex then just give up - we'll try
* again on the next iteration.
*/
static void cache_reap(struct work_struct *w)
{
struct kmem_cache *searchp;
struct kmem_cache_node *n;
numa: slab: use numa_mem_id() for slab local memory node Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:45:03 +08:00
int node = numa_mem_id();
struct delayed_work *work = to_delayed_work(w);
if (!mutex_trylock(&slab_mutex))
/* Give up. Setup the next iteration. */
goto out;
list_for_each_entry(searchp, &slab_caches, list) {
check_irq_on();
/*
* We only take the node lock if absolutely necessary and we
* have established with reasonable certainty that
* we can do some work if the lock was obtained.
*/
n = get_node(searchp, node);
reap_alien(searchp, n);
drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
/*
* These are racy checks but it does not matter
* if we skip one check or scan twice.
*/
if (time_after(n->next_reap, jiffies))
goto next;
n->next_reap = jiffies + REAPTIMEOUT_NODE;
drain_array(searchp, n, n->shared, 0, node);
if (n->free_touched)
n->free_touched = 0;
else {
int freed;
freed = drain_freelist(searchp, n, (n->free_limit +
5 * searchp->num - 1) / (5 * searchp->num));
STATS_ADD_REAPED(searchp, freed);
}
next:
cond_resched();
}
check_irq_on();
mutex_unlock(&slab_mutex);
[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-10 09:33:54 +08:00
next_reap_node();
out:
/* Set up the next iteration */
schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_AC));
}
#ifdef CONFIG_SLABINFO
void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
{
struct page *page;
unsigned long active_objs;
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
const char *name;
char *error = NULL;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
int node;
struct kmem_cache_node *n;
active_objs = 0;
num_slabs = 0;
for_each_kmem_cache_node(cachep, node, n) {
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
check_irq_on();
spin_lock_irq(&n->list_lock);
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
list_for_each_entry(page, &n->slabs_full, lru) {
if (page->active != cachep->num && !error)
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
error = "slabs_full accounting error";
active_objs += cachep->num;
active_slabs++;
}
list_for_each_entry(page, &n->slabs_partial, lru) {
if (page->active == cachep->num && !error)
error = "slabs_partial accounting error";
if (!page->active && !error)
error = "slabs_partial accounting error";
active_objs += page->active;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
active_slabs++;
}
list_for_each_entry(page, &n->slabs_free, lru) {
if (page->active && !error)
error = "slabs_free accounting error";
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
num_slabs++;
}
free_objects += n->free_objects;
if (n->shared)
shared_avail += n->shared->avail;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
spin_unlock_irq(&n->list_lock);
}
num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
if (num_objs - active_objs != free_objects && !error)
error = "free_objects accounting error";
name = cachep->name;
if (error)
printk(KERN_ERR "slab: cache %s error: %s\n", name, error);
sinfo->active_objs = active_objs;
sinfo->num_objs = num_objs;
sinfo->active_slabs = active_slabs;
sinfo->num_slabs = num_slabs;
sinfo->shared_avail = shared_avail;
sinfo->limit = cachep->limit;
sinfo->batchcount = cachep->batchcount;
sinfo->shared = cachep->shared;
sinfo->objects_per_slab = cachep->num;
sinfo->cache_order = cachep->gfporder;
}
void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *cachep)
{
#if STATS
{ /* node stats */
unsigned long high = cachep->high_mark;
unsigned long allocs = cachep->num_allocations;
unsigned long grown = cachep->grown;
unsigned long reaped = cachep->reaped;
unsigned long errors = cachep->errors;
unsigned long max_freeable = cachep->max_freeable;
unsigned long node_allocs = cachep->node_allocs;
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
unsigned long node_frees = cachep->node_frees;
unsigned long overflows = cachep->node_overflow;
seq_printf(m, " : globalstat %7lu %6lu %5lu %4lu "
"%4lu %4lu %4lu %4lu %4lu",
allocs, high, grown,
reaped, errors, max_freeable, node_allocs,
node_frees, overflows);
}
/* cpu stats */
{
unsigned long allochit = atomic_read(&cachep->allochit);
unsigned long allocmiss = atomic_read(&cachep->allocmiss);
unsigned long freehit = atomic_read(&cachep->freehit);
unsigned long freemiss = atomic_read(&cachep->freemiss);
seq_printf(m, " : cpustat %6lu %6lu %6lu %6lu",
allochit, allocmiss, freehit, freemiss);
}
#endif
}
#define MAX_SLABINFO_WRITE 128
/**
* slabinfo_write - Tuning for the slab allocator
* @file: unused
* @buffer: user buffer
* @count: data length
* @ppos: unused
*/
ssize_t slabinfo_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
char kbuf[MAX_SLABINFO_WRITE + 1], *tmp;
int limit, batchcount, shared, res;
struct kmem_cache *cachep;
if (count > MAX_SLABINFO_WRITE)
return -EINVAL;
if (copy_from_user(&kbuf, buffer, count))
return -EFAULT;
kbuf[MAX_SLABINFO_WRITE] = '\0';
tmp = strchr(kbuf, ' ');
if (!tmp)
return -EINVAL;
*tmp = '\0';
tmp++;
if (sscanf(tmp, " %d %d %d", &limit, &batchcount, &shared) != 3)
return -EINVAL;
/* Find the cache in the chain of caches. */
mutex_lock(&slab_mutex);
res = -EINVAL;
list_for_each_entry(cachep, &slab_caches, list) {
if (!strcmp(cachep->name, kbuf)) {
if (limit < 1 || batchcount < 1 ||
batchcount > limit || shared < 0) {
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
res = 0;
} else {
[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 04:03:32 +08:00
res = do_tune_cpucache(cachep, limit,
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
batchcount, shared,
GFP_KERNEL);
}
break;
}
}
mutex_unlock(&slab_mutex);
if (res >= 0)
res = count;
return res;
}
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
#ifdef CONFIG_DEBUG_SLAB_LEAK
static inline int add_caller(unsigned long *n, unsigned long v)
{
unsigned long *p;
int l;
if (!v)
return 1;
l = n[1];
p = n + 2;
while (l) {
int i = l/2;
unsigned long *q = p + 2 * i;
if (*q == v) {
q[1]++;
return 1;
}
if (*q > v) {
l = i;
} else {
p = q + 2;
l -= i + 1;
}
}
if (++n[1] == n[0])
return 0;
memmove(p + 2, p, n[1] * 2 * sizeof(unsigned long) - ((void *)p - (void *)n));
p[0] = v;
p[1] = 1;
return 1;
}
static void handle_slab(unsigned long *n, struct kmem_cache *c,
struct page *page)
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
{
void *p;
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
int i;
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
if (n[0] == n[1])
return;
for (i = 0, p = page->s_mem; i < c->num; i++, p += c->size) {
slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-24 04:22:06 +08:00
if (get_obj_status(page, i) != OBJECT_ACTIVE)
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
continue;
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
if (!add_caller(n, (unsigned long)*dbg_userword(c, p)))
return;
}
}
static void show_symbol(struct seq_file *m, unsigned long address)
{
#ifdef CONFIG_KALLSYMS
unsigned long offset, size;
char modname[MODULE_NAME_LEN], name[KSYM_NAME_LEN];
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
if (lookup_symbol_attrs(address, &size, &offset, modname, name) == 0) {
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
seq_printf(m, "%s+%#lx/%#lx", name, offset, size);
if (modname[0])
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
seq_printf(m, " [%s]", modname);
return;
}
#endif
seq_printf(m, "%p", (void *)address);
}
static int leaks_show(struct seq_file *m, void *p)
{
struct kmem_cache *cachep = list_entry(p, struct kmem_cache, list);
struct page *page;
struct kmem_cache_node *n;
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
const char *name;
unsigned long *x = m->private;
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
int node;
int i;
if (!(cachep->flags & SLAB_STORE_USER))
return 0;
if (!(cachep->flags & SLAB_RED_ZONE))
return 0;
/* OK, we can do it */
x[1] = 0;
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
for_each_kmem_cache_node(cachep, node, n) {
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
check_irq_on();
spin_lock_irq(&n->list_lock);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
list_for_each_entry(page, &n->slabs_full, lru)
handle_slab(x, cachep, page);
list_for_each_entry(page, &n->slabs_partial, lru)
handle_slab(x, cachep, page);
spin_unlock_irq(&n->list_lock);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
}
name = cachep->name;
if (x[0] == x[1]) {
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
/* Increase the buffer size */
mutex_unlock(&slab_mutex);
m->private = kzalloc(x[0] * 4 * sizeof(unsigned long), GFP_KERNEL);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
if (!m->private) {
/* Too bad, we are really out */
m->private = x;
mutex_lock(&slab_mutex);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
return -ENOMEM;
}
*(unsigned long *)m->private = x[0] * 2;
kfree(x);
mutex_lock(&slab_mutex);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
/* Now make sure this entry will be retried */
m->count = m->size;
return 0;
}
for (i = 0; i < x[1]; i++) {
seq_printf(m, "%s: %lu ", name, x[2*i+3]);
show_symbol(m, x[2*i+2]);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
seq_putc(m, '\n');
}
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
return 0;
}
static const struct seq_operations slabstats_op = {
.start = slab_start,
.next = slab_next,
.stop = slab_stop,
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
.show = leaks_show,
};
static int slabstats_open(struct inode *inode, struct file *file)
{
unsigned long *n;
n = __seq_open_private(file, &slabstats_op, PAGE_SIZE);
if (!n)
return -ENOMEM;
*n = PAGE_SIZE / (2 * sizeof(unsigned long));
return 0;
}
static const struct file_operations proc_slabstats_operations = {
.open = slabstats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release_private,
};
#endif
static int __init slab_proc_init(void)
{
#ifdef CONFIG_DEBUG_SLAB_LEAK
proc_create("slab_allocators", 0, NULL, &proc_slabstats_operations);
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
#endif
return 0;
}
module_init(slab_proc_init);
#endif
/**
* ksize - get the actual amount of memory allocated for a given object
* @objp: Pointer to the object
*
* kmalloc may internally round up allocations and return more memory
* than requested. ksize() can be used to determine the actual amount of
* memory allocated. The caller may use this additional memory, even though
* a smaller amount of memory was initially specified with the kmalloc call.
* The caller must guarantee that objp points to a valid object previously
* allocated with either kmalloc() or kmem_cache_alloc(). The object
* must not be freed during the duration of the call.
*/
size_t ksize(const void *objp)
{
BUG_ON(!objp);
if (unlikely(objp == ZERO_SIZE_PTR))
return 0;
return virt_to_cache(objp)->object_size;
}
EXPORT_SYMBOL(ksize);