2005-10-30 09:16:54 +08:00
|
|
|
/*
|
|
|
|
* linux/mm/memory_hotplug.c
|
|
|
|
*
|
|
|
|
* Copyright (C)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/stddef.h>
|
|
|
|
#include <linux/mm.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/compiler.h>
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/pagevec.h>
|
2006-09-29 17:01:25 +08:00
|
|
|
#include <linux/writeback.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/memory.h>
|
2016-01-16 08:56:22 +08:00
|
|
|
#include <linux/memremap.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/memory_hotplug.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2006-06-27 17:53:35 +08:00
|
|
|
#include <linux/ioport.h>
|
2007-10-16 16:26:12 +08:00
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/migrate.h>
|
|
|
|
#include <linux/page-isolation.h>
|
2008-10-19 11:25:58 +08:00
|
|
|
#include <linux/pfn.h>
|
2009-11-18 06:06:22 +08:00
|
|
|
#include <linux/suspend.h>
|
2009-12-15 09:58:11 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2010-03-06 05:41:58 +08:00
|
|
|
#include <linux/firmware-map.h>
|
2013-02-23 08:33:14 +08:00
|
|
|
#include <linux/stop_machine.h>
|
2013-09-12 05:22:09 +08:00
|
|
|
#include <linux/hugetlb.h>
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
#include <linux/memblock.h>
|
2014-11-14 07:19:39 +08:00
|
|
|
#include <linux/bootmem.h>
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
#include <linux/compaction.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
2008-04-29 01:40:08 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
/*
|
|
|
|
* online_page_callback contains pointer to current page onlining function.
|
|
|
|
* Initially it is generic_online_page(). If it is required it could be
|
|
|
|
* changed by calling set_online_page_callback() for callback registration
|
|
|
|
* and restore_online_page_callback() for generic callback restore.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page);
|
|
|
|
|
|
|
|
static online_page_callback_t online_page_callback = generic_online_page;
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
static DEFINE_MUTEX(online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
void get_online_mems(void)
|
|
|
|
{
|
|
|
|
percpu_down_read(&mem_hotplug_lock);
|
|
|
|
}
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
void put_online_mems(void)
|
|
|
|
{
|
|
|
|
percpu_up_read(&mem_hotplug_lock);
|
|
|
|
}
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-07 06:41:05 +08:00
|
|
|
bool movable_node_enabled = false;
|
|
|
|
|
2016-05-20 08:13:03 +08:00
|
|
|
#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
|
2016-03-16 05:56:48 +08:00
|
|
|
bool memhp_auto_online;
|
2016-05-20 08:13:03 +08:00
|
|
|
#else
|
|
|
|
bool memhp_auto_online = true;
|
|
|
|
#endif
|
2016-03-16 05:56:48 +08:00
|
|
|
EXPORT_SYMBOL_GPL(memhp_auto_online);
|
|
|
|
|
2016-05-20 08:13:06 +08:00
|
|
|
static int __init setup_memhp_default_state(char *str)
|
|
|
|
{
|
|
|
|
if (!strcmp(str, "online"))
|
|
|
|
memhp_auto_online = true;
|
|
|
|
else if (!strcmp(str, "offline"))
|
|
|
|
memhp_auto_online = false;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("memhp_default_state=", setup_memhp_default_state);
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_begin(void)
|
2010-12-03 06:31:19 +08:00
|
|
|
{
|
2017-07-11 06:50:09 +08:00
|
|
|
cpus_read_lock();
|
|
|
|
percpu_down_write(&mem_hotplug_lock);
|
2010-12-03 06:31:19 +08:00
|
|
|
}
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_done(void)
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
{
|
2017-07-11 06:50:09 +08:00
|
|
|
percpu_up_write(&mem_hotplug_lock);
|
|
|
|
cpus_read_unlock();
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
}
|
2010-12-03 06:31:19 +08:00
|
|
|
|
2006-10-01 14:27:09 +08:00
|
|
|
/* add this memory to iomem resource */
|
|
|
|
static struct resource *register_memory_resource(u64 start, u64 size)
|
|
|
|
{
|
2017-09-09 07:11:43 +08:00
|
|
|
struct resource *res, *conflict;
|
2006-10-01 14:27:09 +08:00
|
|
|
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (!res)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2006-10-01 14:27:09 +08:00
|
|
|
|
|
|
|
res->name = "System RAM";
|
|
|
|
res->start = start;
|
|
|
|
res->end = start + size - 1;
|
2016-01-27 04:57:24 +08:00
|
|
|
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
|
2017-09-09 07:11:43 +08:00
|
|
|
conflict = request_resource_conflict(&iomem_resource, res);
|
|
|
|
if (conflict) {
|
|
|
|
if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
|
|
|
|
pr_debug("Device unaddressable memory block "
|
|
|
|
"memory hotplug at %#010llx !\n",
|
|
|
|
(unsigned long long)start);
|
|
|
|
}
|
2013-07-04 06:02:39 +08:00
|
|
|
pr_debug("System RAM resource %pR cannot be added\n", res);
|
2006-10-01 14:27:09 +08:00
|
|
|
kfree(res);
|
2016-01-15 07:21:55 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
2006-10-01 14:27:09 +08:00
|
|
|
}
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_memory_resource(struct resource *res)
|
|
|
|
{
|
|
|
|
if (!res)
|
|
|
|
return;
|
|
|
|
release_resource(res);
|
|
|
|
kfree(res);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-10-01 14:27:08 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
2013-02-23 08:33:00 +08:00
|
|
|
void get_page_bootmem(unsigned long info, struct page *page,
|
|
|
|
unsigned long type)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2017-02-23 07:45:13 +08:00
|
|
|
page->freelist = (void *)type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
SetPagePrivate(page);
|
|
|
|
set_page_private(page, info);
|
2016-03-18 05:19:26 +08:00
|
|
|
page_ref_inc(page);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-04 06:03:17 +08:00
|
|
|
void put_page_bootmem(struct page *page)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2011-01-14 07:47:00 +08:00
|
|
|
unsigned long type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2017-02-23 07:45:13 +08:00
|
|
|
type = (unsigned long) page->freelist;
|
2011-01-14 07:47:00 +08:00
|
|
|
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
|
|
|
|
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2016-03-18 05:19:26 +08:00
|
|
|
if (page_ref_dec_return(page) == 1) {
|
2017-02-23 07:45:13 +08:00
|
|
|
page->freelist = NULL;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
2011-01-14 07:47:00 +08:00
|
|
|
INIT_LIST_HEAD(&page->lru);
|
2013-07-04 06:03:17 +08:00
|
|
|
free_reserved_page(page);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-02-23 08:33:00 +08:00
|
|
|
#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
|
|
|
|
#ifndef CONFIG_SPARSEMEM_VMEMMAP
|
2008-07-24 12:28:12 +08:00
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
/* Get section's memmap address */
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get page for the memmap's phys address
|
|
|
|
* XXX: need more consideration for sparse_vmemmap...
|
|
|
|
*/
|
|
|
|
page = virt_to_page(memmap);
|
|
|
|
mapsize = sizeof(struct page) * PAGES_PER_SECTION;
|
|
|
|
mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
/* remember memmap's page */
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, SECTION_INFO);
|
|
|
|
|
2018-02-01 08:17:25 +08:00
|
|
|
usemap = ms->pageblock_flags;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
2008-07-24 12:28:17 +08:00
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#else /* CONFIG_SPARSEMEM_VMEMMAP */
|
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
|
|
|
|
|
2018-02-01 08:17:25 +08:00
|
|
|
usemap = ms->pageblock_flags;
|
2013-02-23 08:33:00 +08:00
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
|
|
|
}
|
|
|
|
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2016-05-28 06:23:32 +08:00
|
|
|
void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
|
|
|
unsigned long i, pfn, end_pfn, nr_pages;
|
|
|
|
int node = pgdat->node_id;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
|
|
|
|
page = virt_to_page(pgdat);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++, page++)
|
|
|
|
get_page_bootmem(node, page, NODE_INFO);
|
|
|
|
|
|
|
|
pfn = pgdat->node_start_pfn;
|
2013-02-23 08:35:32 +08:00
|
|
|
end_pfn = pgdat_end_pfn(pgdat);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2013-07-09 07:00:23 +08:00
|
|
|
/* register section info */
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
/*
|
|
|
|
* Some platforms can assign the same pfn to multiple nodes - on
|
|
|
|
* node0 as well as nodeN. To avoid registering a pfn against
|
|
|
|
* multiple nodes we check that this pfn does not already
|
2013-07-09 07:00:23 +08:00
|
|
|
* reside in some other nodes.
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
*/
|
2016-05-28 05:27:32 +08:00
|
|
|
if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
register_page_bootmem_info_section(pfn);
|
|
|
|
}
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
|
2017-12-29 15:53:54 +08:00
|
|
|
struct vmem_altmap *altmap, bool want_memblock)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2006-08-06 03:15:06 +08:00
|
|
|
if (pfn_valid(phys_start_pfn))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2017-12-29 15:53:54 +08:00
|
|
|
ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn, altmap);
|
2005-10-30 09:16:54 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2017-07-07 06:37:45 +08:00
|
|
|
if (!want_memblock)
|
|
|
|
return 0;
|
|
|
|
|
2018-04-06 07:22:56 +08:00
|
|
|
return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/*
|
|
|
|
* Reasonably generic function for adding memory. It is
|
|
|
|
* expected that archs that support memory hotplug will
|
|
|
|
* call this function after deciding the zone to which to
|
|
|
|
* add the new pages.
|
|
|
|
*/
|
2017-07-07 06:38:11 +08:00
|
|
|
int __ref __add_pages(int nid, unsigned long phys_start_pfn,
|
2017-12-29 15:53:53 +08:00
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap,
|
|
|
|
bool want_memblock)
|
2013-04-30 06:08:22 +08:00
|
|
|
{
|
|
|
|
unsigned long i;
|
|
|
|
int err = 0;
|
|
|
|
int start_sec, end_sec;
|
2016-01-16 08:56:22 +08:00
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/* during initialize mem_map, align hot-added range to section */
|
|
|
|
start_sec = pfn_to_section_nr(phys_start_pfn);
|
|
|
|
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
if (altmap) {
|
|
|
|
/*
|
|
|
|
* Validate altmap is within bounds of the total request
|
|
|
|
*/
|
|
|
|
if (altmap->base_pfn != phys_start_pfn
|
|
|
|
|| vmem_altmap_offset(altmap) > nr_pages) {
|
|
|
|
pr_warn_once("memory add fail, invalid altmap\n");
|
2016-03-16 05:57:51 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
2016-01-16 08:56:22 +08:00
|
|
|
}
|
|
|
|
altmap->alloc = 0;
|
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
for (i = start_sec; i <= end_sec; i++) {
|
2017-12-29 15:53:54 +08:00
|
|
|
err = __add_section(nid, section_nr_to_pfn(i), altmap,
|
|
|
|
want_memblock);
|
2013-04-30 06:08:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* EEXIST is finally dealt with by ioresource collision
|
|
|
|
* check. see add_memory() => register_memory_resource()
|
|
|
|
* Warning will be printed if there is collision.
|
|
|
|
*/
|
|
|
|
if (err && (err != -EEXIST))
|
|
|
|
break;
|
|
|
|
err = 0;
|
2017-10-04 07:16:16 +08:00
|
|
|
cond_resched();
|
2013-04-30 06:08:22 +08:00
|
|
|
}
|
2015-06-25 07:58:42 +08:00
|
|
|
vmemmap_populate_print_last();
|
2016-03-16 05:57:51 +08:00
|
|
|
out:
|
2013-04-30 06:08:22 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-02-23 08:33:12 +08:00
|
|
|
/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
|
2017-10-04 07:16:32 +08:00
|
|
|
static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
|
|
|
|
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(start_pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(start_pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(start_pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return start_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
|
2017-10-04 07:16:32 +08:00
|
|
|
static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
/* pfn is the end pfn of a memory section. */
|
|
|
|
pfn = end_pfn - 1;
|
|
|
|
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
2013-09-12 05:21:44 +08:00
|
|
|
unsigned long zone_start_pfn = zone->zone_start_pfn;
|
|
|
|
unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
|
|
|
|
unsigned long zone_end_pfn = z;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
if (zone_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the zone, it need
|
|
|
|
* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
|
|
|
|
zone_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
zone->zone_start_pfn = pfn;
|
|
|
|
zone->spanned_pages = zone_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (zone_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the zone, it need
|
|
|
|
* shrink zone->spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
zone->spanned_pages = pfn - zone_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The section is not biggest or smallest mem_section in the zone, it
|
|
|
|
* only creates a hole in the zone. So in this case, we need not
|
|
|
|
* change the zone. But perhaps, the zone has only hole data. Thus
|
|
|
|
* it check the zone has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = zone_start_pfn;
|
|
|
|
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (page_zone(pfn_to_page(pfn)) != zone)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The zone has no valid section */
|
|
|
|
zone->zone_start_pfn = 0;
|
|
|
|
zone->spanned_pages = 0;
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_pgdat_span(struct pglist_data *pgdat,
|
|
|
|
unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2013-11-13 07:07:19 +08:00
|
|
|
unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
|
|
|
|
unsigned long pgdat_end_pfn = p;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
|
|
|
|
if (pgdat_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
|
|
|
|
pgdat_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
pgdat->node_start_pfn = pfn;
|
|
|
|
pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (pgdat_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the section is not biggest or smallest mem_section in the pgdat,
|
|
|
|
* it only creates a hole in the pgdat. So in this case, we need not
|
|
|
|
* change the pgdat.
|
|
|
|
* But perhaps, the pgdat has only hole data. Thus it check the pgdat
|
|
|
|
* has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = pgdat_start_pfn;
|
|
|
|
for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The pgdat has no valid section */
|
|
|
|
pgdat->node_start_pfn = 0;
|
|
|
|
pgdat->node_spanned_pages = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __remove_zone(struct zone *zone, unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nr_pages = PAGES_PER_SECTION;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
|
|
|
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
|
|
|
|
shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
|
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
}
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
static int __remove_section(struct zone *zone, struct mem_section *ms,
|
2017-12-29 15:53:56 +08:00
|
|
|
unsigned long map_offset, struct vmem_altmap *altmap)
|
2008-04-28 17:12:01 +08:00
|
|
|
{
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn;
|
|
|
|
int scn_nr;
|
2008-04-28 17:12:01 +08:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!valid_section(ms))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = unregister_memory_section(ms);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-02-23 08:33:12 +08:00
|
|
|
scn_nr = __section_nr(ms);
|
2017-10-04 07:16:29 +08:00
|
|
|
start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
|
2013-02-23 08:33:12 +08:00
|
|
|
__remove_zone(zone, start_pfn);
|
|
|
|
|
2017-12-29 15:53:56 +08:00
|
|
|
sparse_remove_one_section(zone, ms, map_offset, altmap);
|
2008-04-28 17:12:01 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __remove_pages() - remove sections of pages from a zone
|
|
|
|
* @zone: zone from which pages need to be removed
|
|
|
|
* @phys_start_pfn: starting pageframe (must be aligned to start of a section)
|
|
|
|
* @nr_pages: number of pages to remove (must be multiple of section size)
|
2018-04-06 07:24:57 +08:00
|
|
|
* @altmap: alternative device page map or %NULL if default memmap is used
|
2008-04-28 17:12:01 +08:00
|
|
|
*
|
|
|
|
* Generic helper function to remove section mappings and sysfs entries
|
|
|
|
* for the section of the memory we are removing. Caller needs to make
|
|
|
|
* sure that pages are marked reserved and zones are adjust properly by
|
|
|
|
* calling offline_pages().
|
|
|
|
*/
|
|
|
|
int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
|
2017-12-29 15:53:55 +08:00
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap)
|
2008-04-28 17:12:01 +08:00
|
|
|
{
|
2013-04-30 06:08:20 +08:00
|
|
|
unsigned long i;
|
2016-01-16 08:56:22 +08:00
|
|
|
unsigned long map_offset = 0;
|
|
|
|
int sections_to_remove, ret = 0;
|
|
|
|
|
|
|
|
/* In the ZONE_DEVICE case device driver owns the memory region */
|
|
|
|
if (is_dev_zone(zone)) {
|
|
|
|
if (altmap)
|
|
|
|
map_offset = vmem_altmap_offset(altmap);
|
|
|
|
} else {
|
|
|
|
resource_size_t start, size;
|
|
|
|
|
|
|
|
start = phys_start_pfn << PAGE_SHIFT;
|
|
|
|
size = nr_pages * PAGE_SIZE;
|
|
|
|
|
|
|
|
ret = release_mem_region_adjustable(&iomem_resource, start,
|
|
|
|
size);
|
|
|
|
if (ret) {
|
|
|
|
resource_size_t endres = start + size - 1;
|
|
|
|
|
|
|
|
pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
|
|
|
|
&start, &endres, ret);
|
|
|
|
}
|
|
|
|
}
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2016-03-16 05:57:51 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
/*
|
|
|
|
* We can only remove entire sections
|
|
|
|
*/
|
|
|
|
BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
|
|
|
|
BUG_ON(nr_pages % PAGES_PER_SECTION);
|
|
|
|
|
|
|
|
sections_to_remove = nr_pages / PAGES_PER_SECTION;
|
|
|
|
for (i = 0; i < sections_to_remove; i++) {
|
|
|
|
unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
|
2016-01-16 08:56:22 +08:00
|
|
|
|
2017-12-29 15:53:56 +08:00
|
|
|
ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
|
|
|
|
altmap);
|
2016-01-16 08:56:22 +08:00
|
|
|
map_offset = 0;
|
2008-04-28 17:12:01 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2016-03-16 05:57:51 +08:00
|
|
|
|
|
|
|
set_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2013-04-30 06:08:22 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
int set_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == generic_online_page) {
|
|
|
|
online_page_callback = callback;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(set_online_page_callback);
|
|
|
|
|
|
|
|
int restore_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == callback) {
|
|
|
|
online_page_callback = generic_online_page;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(restore_online_page_callback);
|
|
|
|
|
|
|
|
void __online_page_set_limits(struct page *page)
|
2008-04-28 17:12:03 +08:00
|
|
|
{
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_set_limits);
|
|
|
|
|
|
|
|
void __online_page_increment_counters(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(page, 1);
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_increment_counters);
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
void __online_page_free(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
__free_reserved_page(page);
|
2008-04-28 17:12:03 +08:00
|
|
|
}
|
2011-07-26 08:12:05 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__online_page_free);
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page)
|
|
|
|
{
|
|
|
|
__online_page_set_limits(page);
|
|
|
|
__online_page_increment_counters(page);
|
|
|
|
__online_page_free(page);
|
|
|
|
}
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *arg)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
unsigned long i;
|
2007-10-16 16:26:10 +08:00
|
|
|
unsigned long onlined_pages = *(unsigned long *)arg;
|
|
|
|
struct page *page;
|
2017-07-07 06:37:56 +08:00
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
if (PageReserved(pfn_to_page(start_pfn)))
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
page = pfn_to_page(start_pfn + i);
|
2011-07-26 08:12:05 +08:00
|
|
|
(*online_page_callback)(page);
|
2007-10-16 16:26:10 +08:00
|
|
|
onlined_pages++;
|
|
|
|
}
|
2017-07-07 06:37:56 +08:00
|
|
|
|
|
|
|
online_mem_sections(start_pfn, start_pfn + nr_pages);
|
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
*(unsigned long *)arg = onlined_pages;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when online memory */
|
|
|
|
static void node_states_check_changes_online(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
enum zone_type zone_last = ZONE_NORMAL;
|
|
|
|
|
|
|
|
/*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_NORMAL,
|
|
|
|
* set zone_last to ZONE_NORMAL.
|
2012-12-12 08:01:03 +08:00
|
|
|
*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we don't have HIGHMEM nor movable node,
|
|
|
|
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
|
|
|
|
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (N_MEMORY == N_NORMAL_MEMORY)
|
2012-12-12 08:01:03 +08:00
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if the memory to be online is in a zone of 0...zone_last, and
|
|
|
|
* the zones of 0...zone_last don't have memory before online, we will
|
|
|
|
* need to set the node to node_states[N_NORMAL_MEMORY] after
|
|
|
|
* the memory is online.
|
|
|
|
*/
|
|
|
|
if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
|
|
|
|
arg->status_change_nid_normal = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
|
|
|
* If we have movable node, node_states[N_HIGH_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_HIGHMEM,
|
|
|
|
* set zone_last to ZONE_HIGHMEM.
|
|
|
|
*
|
|
|
|
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_MOVABLE,
|
|
|
|
* set zone_last to ZONE_MOVABLE.
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_HIGHMEM;
|
|
|
|
if (N_MEMORY == N_HIGH_MEMORY)
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
|
|
|
|
arg->status_change_nid_high = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid_high = -1;
|
|
|
|
#else
|
|
|
|
arg->status_change_nid_high = arg->status_change_nid_normal;
|
|
|
|
#endif
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
|
|
|
* if the node don't have memory befor online, we will need to
|
2012-12-13 05:51:49 +08:00
|
|
|
* set the node to node_states[N_MEMORY] after the memory
|
2012-12-12 08:01:03 +08:00
|
|
|
* is online.
|
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (!node_state(nid, N_MEMORY))
|
2012-12-12 08:01:03 +08:00
|
|
|
arg->status_change_nid = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_set_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_set_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
if (arg->status_change_nid_high >= 0)
|
|
|
|
node_set_state(node, N_HIGH_MEMORY);
|
|
|
|
|
|
|
|
node_set_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
unsigned long old_end_pfn = zone_end_pfn(zone);
|
|
|
|
|
|
|
|
if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
|
|
|
|
zone->zone_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
|
|
|
|
|
|
|
|
if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
|
|
|
|
pgdat->node_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
|
|
|
|
}
|
|
|
|
|
2017-12-29 15:53:57 +08:00
|
|
|
void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap)
|
2017-07-07 06:38:11 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
unsigned long flags;
|
2016-07-27 06:22:23 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
if (zone_is_empty(zone))
|
|
|
|
init_currently_empty_zone(zone, start_pfn, nr_pages);
|
2016-07-27 06:22:23 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
|
|
|
/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
|
|
|
|
pgdat_resize_lock(pgdat, &flags);
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
resize_zone_range(zone, start_pfn, nr_pages);
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
resize_pgdat_range(pgdat, start_pfn, nr_pages);
|
|
|
|
pgdat_resize_unlock(pgdat, &flags);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TODO now we have a visible range of pages which are not associated
|
|
|
|
* with their zone properly. Not nice but set_pfnblock_flags_mask
|
|
|
|
* expects the zone spans the pfn range. All the pages in the range
|
|
|
|
* are reserved so nobody should be touching them so we should be safe
|
|
|
|
*/
|
2017-12-29 15:53:57 +08:00
|
|
|
memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
|
|
|
|
MEMMAP_HOTPLUG, altmap);
|
2017-07-07 06:38:11 +08:00
|
|
|
|
|
|
|
set_zone_contiguous(zone);
|
|
|
|
}
|
|
|
|
|
2017-07-07 06:38:18 +08:00
|
|
|
/*
|
|
|
|
* Returns a default kernel memory zone for the given pfn range.
|
|
|
|
* If no kernel zone covers this pfn range it will automatically go
|
|
|
|
* to the ZONE_NORMAL.
|
|
|
|
*/
|
2017-09-07 07:19:40 +08:00
|
|
|
static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn,
|
2017-07-07 06:38:18 +08:00
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = NODE_DATA(nid);
|
|
|
|
int zid;
|
|
|
|
|
|
|
|
for (zid = 0; zid <= ZONE_NORMAL; zid++) {
|
|
|
|
struct zone *zone = &pgdat->node_zones[zid];
|
|
|
|
|
|
|
|
if (zone_intersects(zone, start_pfn, nr_pages))
|
|
|
|
return zone;
|
|
|
|
}
|
|
|
|
|
|
|
|
return &pgdat->node_zones[ZONE_NORMAL];
|
|
|
|
}
|
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
2017-09-07 07:19:37 +08:00
|
|
|
{
|
2017-09-07 07:19:40 +08:00
|
|
|
struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
|
|
|
|
nr_pages);
|
|
|
|
struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
|
|
|
|
bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
|
|
|
|
bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
|
2017-09-07 07:19:37 +08:00
|
|
|
|
|
|
|
/*
|
2017-09-07 07:19:40 +08:00
|
|
|
* We inherit the existing zone in a simple case where zones do not
|
|
|
|
* overlap in the given range
|
2017-09-07 07:19:37 +08:00
|
|
|
*/
|
2017-09-07 07:19:40 +08:00
|
|
|
if (in_kernel ^ in_movable)
|
|
|
|
return (in_kernel) ? kernel_zone : movable_zone;
|
2017-07-11 06:48:37 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
/*
|
|
|
|
* If the range doesn't belong to any zone or two zones overlap in the
|
|
|
|
* given range then we use movable zone only if movable_node is
|
|
|
|
* enabled because we always online to a kernel zone by default.
|
|
|
|
*/
|
|
|
|
return movable_node_enabled ? movable_zone : kernel_zone;
|
2017-07-11 06:48:37 +08:00
|
|
|
}
|
|
|
|
|
2017-09-07 07:19:37 +08:00
|
|
|
struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
|
|
|
|
unsigned long nr_pages)
|
2017-07-07 06:38:11 +08:00
|
|
|
{
|
2017-09-07 07:19:40 +08:00
|
|
|
if (online_type == MMOP_ONLINE_KERNEL)
|
|
|
|
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
|
2017-07-07 06:38:11 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
if (online_type == MMOP_ONLINE_MOVABLE)
|
|
|
|
return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
|
2016-07-27 06:22:23 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
return default_zone_for_pfn(nid, start_pfn, nr_pages);
|
2017-09-07 07:19:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Associates the given pfn range with the given node and the zone appropriate
|
|
|
|
* for the given online type.
|
|
|
|
*/
|
|
|
|
static struct zone * __meminit move_pfn_range(int online_type, int nid,
|
|
|
|
unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
|
|
|
|
zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
|
2017-12-29 15:53:57 +08:00
|
|
|
move_pfn_range_to_zone(zone, start_pfn, nr_pages, NULL);
|
2017-07-07 06:38:11 +08:00
|
|
|
return zone;
|
2016-07-27 06:22:23 +08:00
|
|
|
}
|
2007-10-16 16:26:10 +08:00
|
|
|
|
2017-09-07 07:20:37 +08:00
|
|
|
/* Must be protected by mem_hotplug_begin() or a device_lock */
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
|
2007-10-16 16:26:10 +08:00
|
|
|
{
|
2013-07-04 06:02:10 +08:00
|
|
|
unsigned long flags;
|
2005-10-30 09:16:54 +08:00
|
|
|
unsigned long onlined_pages = 0;
|
|
|
|
struct zone *zone;
|
2006-06-23 17:03:11 +08:00
|
|
|
int need_zonelists_rebuild = 0;
|
2007-10-22 07:41:36 +08:00
|
|
|
int nid;
|
|
|
|
int ret;
|
|
|
|
struct memory_notify arg;
|
2018-04-06 07:23:00 +08:00
|
|
|
struct memory_block *mem;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't use pfn_to_nid() because nid might be stored in struct page
|
|
|
|
* which is not yet initialized. Instead, we find nid from memory block.
|
|
|
|
*/
|
|
|
|
mem = find_memory_block(__pfn_to_section(pfn));
|
|
|
|
nid = mem->nid;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
/* associate pfn range with the zone */
|
|
|
|
zone = move_pfn_range(online_type, nid, pfn, nr_pages);
|
|
|
|
|
2007-10-22 07:41:36 +08:00
|
|
|
arg.start_pfn = pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_online(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
ret = memory_notify(MEM_GOING_ONLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
2016-03-18 05:19:35 +08:00
|
|
|
if (ret)
|
|
|
|
goto failed_addition;
|
|
|
|
|
2006-06-23 17:03:11 +08:00
|
|
|
/*
|
|
|
|
* If this zone is not populated, then it is not in zonelist.
|
|
|
|
* This means the page allocator ignores this zone.
|
|
|
|
* So, zonelist must be updated after online.
|
|
|
|
*/
|
2012-12-12 08:01:01 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2006-06-23 17:03:11 +08:00
|
|
|
need_zonelists_rebuild = 1;
|
2017-09-07 07:20:24 +08:00
|
|
|
setup_zone_pageset(zone);
|
2012-12-12 08:01:01 +08:00
|
|
|
}
|
2006-06-23 17:03:11 +08:00
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
|
2007-10-16 16:26:10 +08:00
|
|
|
online_pages_range);
|
2008-05-15 07:05:50 +08:00
|
|
|
if (ret) {
|
2012-12-12 08:01:01 +08:00
|
|
|
if (need_zonelists_rebuild)
|
|
|
|
zone_pcp_reset(zone);
|
2016-03-18 05:19:35 +08:00
|
|
|
goto failed_addition;
|
2008-05-15 07:05:50 +08:00
|
|
|
}
|
|
|
|
|
2005-10-30 09:16:54 +08:00
|
|
|
zone->present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2006-03-10 09:33:51 +08:00
|
|
|
zone->zone_pgdat->node_present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
|
2012-08-01 07:43:30 +08:00
|
|
|
if (onlined_pages) {
|
2016-03-18 05:18:12 +08:00
|
|
|
node_states_set_node(nid, &arg);
|
2012-08-01 07:43:30 +08:00
|
|
|
if (need_zonelists_rebuild)
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(NULL);
|
2012-08-01 07:43:30 +08:00
|
|
|
else
|
|
|
|
zone_pcp_update(zone);
|
|
|
|
}
|
2005-10-30 09:16:54 +08:00
|
|
|
|
2011-05-25 08:11:32 +08:00
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (onlined_pages) {
|
2016-03-18 05:18:12 +08:00
|
|
|
kswapd_run(nid);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_run(nid);
|
|
|
|
}
|
2005-10-30 09:16:56 +08:00
|
|
|
|
2010-05-25 05:32:51 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
2008-07-24 12:28:18 +08:00
|
|
|
|
2006-09-29 17:01:25 +08:00
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
if (onlined_pages)
|
|
|
|
memory_notify(MEM_ONLINE, &arg);
|
2015-04-15 06:45:11 +08:00
|
|
|
return 0;
|
2016-03-18 05:19:35 +08:00
|
|
|
|
|
|
|
failed_addition:
|
|
|
|
pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
|
|
|
|
(unsigned long long) pfn << PAGE_SHIFT,
|
|
|
|
(((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
|
|
|
|
memory_notify(MEM_CANCEL_ONLINE, &arg);
|
|
|
|
return ret;
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
2006-10-01 14:27:08 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
|
2006-06-27 17:53:30 +08:00
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
static void reset_node_present_pages(pg_data_t *pgdat)
|
|
|
|
{
|
|
|
|
struct zone *z;
|
|
|
|
|
|
|
|
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
|
|
|
|
z->present_pages = 0;
|
|
|
|
|
|
|
|
pgdat->node_present_pages = 0;
|
|
|
|
}
|
|
|
|
|
2009-11-18 06:06:18 +08:00
|
|
|
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
|
|
|
|
static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat;
|
|
|
|
unsigned long zones_size[MAX_NR_ZONES] = {0};
|
|
|
|
unsigned long zholes_size[MAX_NR_ZONES] = {0};
|
2014-06-05 07:07:51 +08:00
|
|
|
unsigned long start_pfn = PFN_DOWN(start);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
pgdat = NODE_DATA(nid);
|
|
|
|
if (!pgdat) {
|
|
|
|
pgdat = arch_alloc_nodedata(nid);
|
|
|
|
if (!pgdat)
|
|
|
|
return NULL;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
arch_refresh_nodedata(nid, pgdat);
|
mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-26 06:55:20 +08:00
|
|
|
} else {
|
mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken
for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
this zone will be 0.
However, the decision on whether to sleep is made on
kswapd_classzone_idx which is 0 without a recent wakeup request and that
classzone does not account for lowmem reserves. This allows kswapd to
sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
request even if a stream of allocations cannot use that zone. While
kswapd may be woken again shortly in the near future there are two
consequences -- the pgdat bits that control congestion are cleared
prematurely and direct reclaim is more likely as kswapd slept
prematurely.
This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
invalid index) when there has been no recent wakeups. If there are no
wakeups, it'll decide whether to sleep based on the highest possible
zone available (MAX_NR_ZONES - 1). It then becomes critical that the
"pgdat balanced" decisions during reclaim and when deciding to sleep are
the same. If there is a mismatch, kswapd can stay awake continually
trying to balance tiny zones.
simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent
4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)
With this patch on top, write and allocation latencies are massively
improved. The read latencies are slightly impaired but it's worth
noting that this is mostly due to the IO scheduler and not directly
related to reclaim. The vmstats are a bit of a mix but the relevant
ones are as follows;
4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0 0 0
Swap Outs 0 608 0
Direct pages scanned 6910672 3132699 6357298
Kswapd pages scanned 57036946 82488665 56986286
Kswapd pages reclaimed 55993488 63474329 55939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency 98% 76% 98%
Kswapd velocity 12494.375 17597.507 12488.065
Direct efficiency 99% 94% 99%
Direct velocity 1513.835 668.306 1393.148
Page writes by reclaim 0.000 4410243.000 0.000
Page writes file 0 4409635 0
Page writes anon 0 608 0
Page reclaim immediate 1036792 14175203 1042571
4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Swap Ins 0 12 0
Swap Outs 0 838 0
Direct pages scanned 6579706 3237270 6256811
Kswapd pages scanned 61853702 79961486 54837791
Kswapd pages reclaimed 60768764 60755788 53849586
Direct pages reclaimed 6579055 2987453 6256151
Kswapd efficiency 98% 75% 98%
Page writes by reclaim 0.000 4389496.000 0.000
Page writes file 0 4388658 0
Page writes anon 0 838 0
Page reclaim immediate 1073573 14473009 982507
Swap-outs are equivalent to baseline.
Direct reclaim is reduced but not eliminated. It's worth noting that
there are two periods of direct reclaim for this workload. The first is
when it switches from preparing the files for the actual test itself.
It's a lot of file IO followed by a lot of allocs that reclaims heavily
for a brief window. While direct reclaim is lower with clear-v2, it is
due to kswapd scanning aggressively and trying to reclaim the world
which is not the right thing to do. With the patches applied, there is
still direct reclaim but the phase change from "creating work files" to
starting multiple threads that allocate a lot of anonymous memory faster
than kswapd can reclaim.
Scanning/reclaim efficiency is restored by this patch.
Page writes from reclaim context are back at 0 which is ideal.
Pages immediately reclaimed after IO completes is slightly improved but
it is expected this will vary slightly.
On UMA, there is almost no change so this is not expected to be a
universal win.
[mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:45 +08:00
|
|
|
/*
|
|
|
|
* Reset the nr_zones, order and classzone_idx before reuse.
|
|
|
|
* Note that kswapd will init kswapd_classzone_idx properly
|
|
|
|
* when it starts in the near future.
|
|
|
|
*/
|
mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-26 06:55:20 +08:00
|
|
|
pgdat->nr_zones = 0;
|
2016-07-29 06:45:49 +08:00
|
|
|
pgdat->kswapd_order = 0;
|
|
|
|
pgdat->kswapd_classzone_idx = 0;
|
2013-02-23 08:33:18 +08:00
|
|
|
}
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
|
|
|
/* we can use NODE_DATA(nid) from here */
|
|
|
|
|
|
|
|
/* init node's zones as empty zones, we don't have any present pages.*/
|
2008-07-24 12:27:20 +08:00
|
|
|
free_area_init_node(nid, zones_size, start_pfn, zholes_size);
|
mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded:
Unable to handle kernel paging request for data at address 0x00c30001
Faulting instruction address: 0xc00000000022f8f4
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
task: c000000000ef3080 task.stack: c000000000f6c000
NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84002028 XER: 20000000
CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
NIP refresh_cpu_vm_stats+0x1a4/0x2f0
LR refresh_cpu_vm_stats+0x1f8/0x2f0
Call Trace:
refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
Add per_cpu_nodestats initialization to the hotplug codepath.
Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-12 06:33:12 +08:00
|
|
|
pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2011-06-16 06:08:38 +08:00
|
|
|
/*
|
|
|
|
* The node we allocated has no zone fallback lists. For avoiding
|
|
|
|
* to access not-initialized zonelist, build here.
|
|
|
|
*/
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(pgdat);
|
2011-06-16 06:08:38 +08:00
|
|
|
|
2014-11-14 07:19:39 +08:00
|
|
|
/*
|
|
|
|
* zone->managed_pages is set to an approximate value in
|
|
|
|
* free_area_init_core(), which will cause
|
|
|
|
* /sys/device/system/node/nodeX/meminfo has wrong data.
|
|
|
|
* So reset it to 0 before any memory is onlined.
|
|
|
|
*/
|
|
|
|
reset_node_managed_pages(pgdat);
|
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
/*
|
|
|
|
* When memory is hot-added, all the memory is in offline state. So
|
|
|
|
* clear all zones' present_pages because they will be updated in
|
|
|
|
* online_pages() and offline_pages().
|
|
|
|
*/
|
|
|
|
reset_node_present_pages(pgdat);
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
return pgdat;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
|
|
|
|
{
|
|
|
|
arch_refresh_nodedata(nid, NULL);
|
mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded:
Unable to handle kernel paging request for data at address 0x00c30001
Faulting instruction address: 0xc00000000022f8f4
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
task: c000000000ef3080 task.stack: c000000000f6c000
NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84002028 XER: 20000000
CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
NIP refresh_cpu_vm_stats+0x1a4/0x2f0
LR refresh_cpu_vm_stats+0x1f8/0x2f0
Call Trace:
refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
Add per_cpu_nodestats initialization to the hotplug codepath.
Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-12 06:33:12 +08:00
|
|
|
free_percpu(pgdat->per_cpu_nodestats);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
arch_free_nodedata(pgdat);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-06-27 17:53:35 +08:00
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
/**
|
|
|
|
* try_online_node - online a node if offlined
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
2013-11-13 07:07:25 +08:00
|
|
|
*
|
2010-05-25 05:32:41 +08:00
|
|
|
* called by cpu_up() to online a node without onlined memory.
|
|
|
|
*/
|
2013-11-13 07:07:25 +08:00
|
|
|
int try_online_node(int nid)
|
2010-05-25 05:32:41 +08:00
|
|
|
{
|
|
|
|
pg_data_t *pgdat;
|
|
|
|
int ret;
|
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
if (node_online(nid))
|
|
|
|
return 0;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2010-05-25 05:32:41 +08:00
|
|
|
pgdat = hotadd_new_pgdat(nid, 0);
|
2011-06-23 09:13:01 +08:00
|
|
|
if (!pgdat) {
|
2013-11-13 07:07:25 +08:00
|
|
|
pr_err("Cannot online node %d due to NULL pgdat\n", nid);
|
2010-05-25 05:32:41 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
node_set_online(nid);
|
|
|
|
ret = register_one_node(nid);
|
|
|
|
BUG_ON(ret);
|
|
|
|
out:
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2010-05-25 05:32:41 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
static int check_hotplug_memory_range(u64 start, u64 size)
|
|
|
|
{
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
unsigned long block_sz = memory_block_size_bytes();
|
|
|
|
u64 block_nr_pages = block_sz >> PAGE_SHIFT;
|
2013-09-12 05:21:49 +08:00
|
|
|
u64 nr_pages = size >> PAGE_SHIFT;
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
u64 start_pfn = PFN_DOWN(start);
|
2013-09-12 05:21:49 +08:00
|
|
|
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
/* memory range must be block size aligned */
|
|
|
|
if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
|
|
|
|
!IS_ALIGNED(nr_pages, block_nr_pages)) {
|
|
|
|
pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
|
|
|
|
block_sz, start, size);
|
2013-09-12 05:21:49 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
static int online_memory_block(struct memory_block *mem, void *arg)
|
|
|
|
{
|
2017-02-25 07:00:02 +08:00
|
|
|
return device_online(&mem->dev);
|
2016-03-16 05:56:48 +08:00
|
|
|
}
|
|
|
|
|
2008-11-23 01:33:24 +08:00
|
|
|
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
|
2016-03-16 05:56:48 +08:00
|
|
|
int __ref add_memory_resource(int nid, struct resource *res, bool online)
|
2006-06-27 17:53:30 +08:00
|
|
|
{
|
2015-06-25 23:35:49 +08:00
|
|
|
u64 start, size;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
pg_data_t *pgdat = NULL;
|
2013-02-23 08:33:18 +08:00
|
|
|
bool new_pgdat;
|
|
|
|
bool new_node;
|
2006-06-27 17:53:30 +08:00
|
|
|
int ret;
|
|
|
|
|
2015-06-25 23:35:49 +08:00
|
|
|
start = res->start;
|
|
|
|
size = resource_size(res);
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
ret = check_hotplug_memory_range(start, size);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
{ /* Stupid hack to suppress address-never-null warning */
|
|
|
|
void *p = NODE_DATA(nid);
|
|
|
|
new_pgdat = !p;
|
|
|
|
}
|
2014-01-24 07:53:26 +08:00
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2014-01-24 07:53:26 +08:00
|
|
|
|
2015-09-05 06:42:32 +08:00
|
|
|
/*
|
|
|
|
* Add new range to memblock so that when hotadd_new_pgdat() is called
|
|
|
|
* to allocate new pgdat, get_pfn_range_for_nid() will be able to find
|
|
|
|
* this new range and calculate total pages correctly. The range will
|
|
|
|
* be removed at hot-remove time.
|
|
|
|
*/
|
|
|
|
memblock_add_node(start, size, nid);
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
new_node = !node_online(nid);
|
|
|
|
if (new_node) {
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
pgdat = hotadd_new_pgdat(nid, start);
|
2009-11-18 06:06:22 +08:00
|
|
|
ret = -ENOMEM;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
if (!pgdat)
|
2012-07-12 05:02:31 +08:00
|
|
|
goto error;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
}
|
|
|
|
|
2006-06-27 17:53:30 +08:00
|
|
|
/* call arch's memory hotadd */
|
2017-12-29 15:53:53 +08:00
|
|
|
ret = arch_add_memory(nid, start, size, NULL, true);
|
2006-06-27 17:53:30 +08:00
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
2006-06-27 17:53:38 +08:00
|
|
|
/* we online node here. we can't roll back from here. */
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
node_set_online(nid);
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
if (new_node) {
|
2017-07-07 06:37:49 +08:00
|
|
|
unsigned long start_pfn = start >> PAGE_SHIFT;
|
|
|
|
unsigned long nr_pages = size >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
ret = __register_one_node(nid);
|
|
|
|
if (ret)
|
|
|
|
goto register_fail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* link memory sections under this node. This is already
|
|
|
|
* done when creatig memory section in register_new_memory
|
|
|
|
* but that depends to have the node registered so offline
|
|
|
|
* nodes have to go through register_node.
|
|
|
|
* TODO clean up this mess.
|
|
|
|
*/
|
2018-05-26 05:47:53 +08:00
|
|
|
ret = link_mem_sections(nid, start_pfn, nr_pages, false);
|
2017-07-07 06:37:49 +08:00
|
|
|
register_fail:
|
2006-06-27 17:53:38 +08:00
|
|
|
/*
|
|
|
|
* If sysfs file of new node can't create, cpu on the node
|
|
|
|
* can't be hot-added. There is no rollback way now.
|
|
|
|
* So, check by BUG_ON() to catch it reluctantly..
|
|
|
|
*/
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
2010-03-06 05:41:58 +08:00
|
|
|
/* create new memmap entry */
|
|
|
|
firmware_map_add_hotplug(start, start + size, "System RAM");
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
/* online pages if requested */
|
|
|
|
if (online)
|
|
|
|
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
|
|
|
|
NULL, online_memory_block);
|
|
|
|
|
2009-11-18 06:06:22 +08:00
|
|
|
goto out;
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
error:
|
|
|
|
/* rollback pgdat allocation and others */
|
2017-07-11 06:47:23 +08:00
|
|
|
if (new_pgdat && pgdat)
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
rollback_node_hotadd(nid, pgdat);
|
2015-09-05 06:42:32 +08:00
|
|
|
memblock_remove(start, size);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2009-11-18 06:06:22 +08:00
|
|
|
out:
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2006-06-27 17:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2015-06-25 23:35:49 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_memory_resource);
|
|
|
|
|
|
|
|
int __ref add_memory(int nid, u64 start, u64 size)
|
|
|
|
{
|
|
|
|
struct resource *res;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
res = register_memory_resource(start, size);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (IS_ERR(res))
|
|
|
|
return PTR_ERR(res);
|
2015-06-25 23:35:49 +08:00
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
ret = add_memory_resource(nid, res, memhp_auto_online);
|
2015-06-25 23:35:49 +08:00
|
|
|
if (ret < 0)
|
|
|
|
release_memory_resource(res);
|
|
|
|
return ret;
|
|
|
|
}
|
2006-06-27 17:53:30 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_memory);
|
2007-10-16 16:26:12 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2008-07-24 12:28:19 +08:00
|
|
|
/*
|
|
|
|
* A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
|
|
|
|
* set and the size of the free page is given by page_order(). Using this,
|
|
|
|
* the function determines if the pageblock contains only free pages.
|
|
|
|
* Due to buddy contraints, a free page at least the size of a pageblock will
|
|
|
|
* be located at the start of the pageblock
|
|
|
|
*/
|
|
|
|
static inline int pageblock_free(struct page *page)
|
|
|
|
{
|
|
|
|
return PageBuddy(page) && page_order(page) >= pageblock_order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Return the start of the next active pageblock after a given page */
|
|
|
|
static struct page *next_active_pageblock(struct page *page)
|
|
|
|
{
|
|
|
|
/* Ensure the starting page is pageblock-aligned */
|
|
|
|
BUG_ON(page_to_pfn(page) & (pageblock_nr_pages - 1));
|
|
|
|
|
|
|
|
/* If the entire pageblock is free, move to the end of free page */
|
2010-09-10 07:38:01 +08:00
|
|
|
if (pageblock_free(page)) {
|
|
|
|
int order;
|
|
|
|
/* be careful. we don't have locks, page_order can be changed.*/
|
|
|
|
order = page_order(page);
|
|
|
|
if ((order < MAX_ORDER) && (order >= pageblock_order))
|
|
|
|
return page + (1 << order);
|
|
|
|
}
|
2008-07-24 12:28:19 +08:00
|
|
|
|
2010-09-10 07:38:01 +08:00
|
|
|
return page + pageblock_nr_pages;
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
2018-06-08 08:07:43 +08:00
|
|
|
static bool is_pageblock_removable_nolock(struct page *page)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to be careful here because we are iterating over memory
|
|
|
|
* sections which are not zone aware so we might end up outside of
|
|
|
|
* the zone but still within the section.
|
|
|
|
* We have to take care about the node as well. If the node is offline
|
|
|
|
* its NODE_DATA will be NULL - see page_zone.
|
|
|
|
*/
|
|
|
|
if (!node_online(page_to_nid(page)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
zone = page_zone(page);
|
|
|
|
pfn = page_to_pfn(page);
|
|
|
|
if (!zone_spans_pfn(zone, pfn))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, true);
|
|
|
|
}
|
|
|
|
|
2008-07-24 12:28:19 +08:00
|
|
|
/* Checks if this range of memory is likely to be hot-removable. */
|
2016-05-20 08:11:26 +08:00
|
|
|
bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
|
2008-07-24 12:28:19 +08:00
|
|
|
{
|
|
|
|
struct page *page = pfn_to_page(start_pfn);
|
|
|
|
struct page *end_page = page + nr_pages;
|
|
|
|
|
|
|
|
/* Check the starting page of each pageblock within the range */
|
|
|
|
for (; page < end_page; page = next_active_pageblock(page)) {
|
2010-10-27 05:21:30 +08:00
|
|
|
if (!is_pageblock_removable_nolock(page))
|
2016-05-20 08:11:26 +08:00
|
|
|
return false;
|
2010-10-27 05:21:30 +08:00
|
|
|
cond_resched();
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* All pageblocks in the memory block are likely to be hot-removable */
|
2016-05-20 08:11:26 +08:00
|
|
|
return true;
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/*
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
* Confirm all pages in a range [start, end) belong to the same zone.
|
2017-02-04 05:13:23 +08:00
|
|
|
* When true, return its valid [start, end).
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2017-02-04 05:13:23 +08:00
|
|
|
int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
|
|
|
|
unsigned long *valid_start, unsigned long *valid_end)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
2015-12-30 06:54:25 +08:00
|
|
|
unsigned long pfn, sec_end_pfn;
|
2017-02-04 05:13:23 +08:00
|
|
|
unsigned long start, end;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone = NULL;
|
|
|
|
struct page *page;
|
|
|
|
int i;
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1);
|
2007-10-16 16:26:12 +08:00
|
|
|
pfn < end_pfn;
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) {
|
2015-12-30 06:54:25 +08:00
|
|
|
/* Make sure the memory section is present first */
|
|
|
|
if (!present_section_nr(pfn_to_section_nr(pfn)))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
2015-12-30 06:54:25 +08:00
|
|
|
for (; pfn < sec_end_pfn && pfn < end_pfn;
|
|
|
|
pfn += MAX_ORDER_NR_PAGES) {
|
|
|
|
i = 0;
|
|
|
|
/* This is just a CONFIG_HOLES_IN_ZONE check.*/
|
|
|
|
while ((i < MAX_ORDER_NR_PAGES) &&
|
|
|
|
!pfn_valid_within(pfn + i))
|
|
|
|
i++;
|
2017-02-25 06:59:30 +08:00
|
|
|
if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
|
2015-12-30 06:54:25 +08:00
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn + i);
|
|
|
|
if (zone && page_zone(page) != zone)
|
|
|
|
return 0;
|
2017-02-04 05:13:23 +08:00
|
|
|
if (!zone)
|
|
|
|
start = pfn + i;
|
2015-12-30 06:54:25 +08:00
|
|
|
zone = page_zone(page);
|
2017-02-04 05:13:23 +08:00
|
|
|
end = pfn + MAX_ORDER_NR_PAGES;
|
2015-12-30 06:54:25 +08:00
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
|
2017-02-04 05:13:23 +08:00
|
|
|
if (zone) {
|
|
|
|
*valid_start = start;
|
2017-02-25 06:59:30 +08:00
|
|
|
*valid_end = min(end, end_pfn);
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
return 1;
|
2017-02-04 05:13:23 +08:00
|
|
|
} else {
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
return 0;
|
2017-02-04 05:13:23 +08:00
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-02-25 06:57:39 +08:00
|
|
|
* Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
|
|
|
|
* non-lru movable pages and hugepages). We scan pfn because it's much
|
|
|
|
* easier than scanning over linked list. This function returns the pfn
|
|
|
|
* of the first found movable page if it's found, otherwise 0.
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2013-09-12 05:22:09 +08:00
|
|
|
static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
for (pfn = start; pfn < end; pfn++) {
|
|
|
|
if (pfn_valid(pfn)) {
|
|
|
|
page = pfn_to_page(pfn);
|
|
|
|
if (PageLRU(page))
|
|
|
|
return pfn;
|
2017-02-25 06:57:39 +08:00
|
|
|
if (__PageMovable(page))
|
|
|
|
return pfn;
|
2013-09-12 05:22:09 +08:00
|
|
|
if (PageHuge(page)) {
|
2015-04-16 07:14:41 +08:00
|
|
|
if (page_huge_active(page))
|
2013-09-12 05:22:09 +08:00
|
|
|
return pfn;
|
|
|
|
else
|
|
|
|
pfn = round_up(pfn + 1,
|
|
|
|
1 << compound_order(page)) - 1;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-11 07:30:03 +08:00
|
|
|
static struct page *new_node_page(struct page *page, unsigned long private)
|
2016-07-29 06:48:53 +08:00
|
|
|
{
|
|
|
|
int nid = page_to_nid(page);
|
2016-09-29 06:22:38 +08:00
|
|
|
nodemask_t nmask = node_states[N_MEMORY];
|
2017-07-11 06:48:41 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* try to allocate from a different node but reuse this node if there
|
|
|
|
* are no other online nodes to be used (e.g. we are offlining a part
|
|
|
|
* of the only existing node)
|
|
|
|
*/
|
|
|
|
node_clear(nid, nmask);
|
|
|
|
if (nodes_empty(nmask))
|
|
|
|
node_set(nid, nmask);
|
2016-07-29 06:48:53 +08:00
|
|
|
|
2017-07-11 06:48:47 +08:00
|
|
|
return new_page_nodemask(page, nid, &nmask);
|
2016-07-29 06:48:53 +08:00
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
#define NR_OFFLINE_AT_ONCE_PAGES (256)
|
|
|
|
static int
|
|
|
|
do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
|
|
|
|
int not_managed = 0;
|
|
|
|
int ret = 0;
|
|
|
|
LIST_HEAD(source);
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
|
|
|
|
if (!pfn_valid(pfn))
|
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn);
|
2013-09-12 05:22:09 +08:00
|
|
|
|
|
|
|
if (PageHuge(page)) {
|
|
|
|
struct page *head = compound_head(page);
|
|
|
|
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
|
|
|
|
if (compound_order(head) > PFN_SECTION_SHIFT) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (isolate_huge_page(page, &source))
|
|
|
|
move_pages -= 1 << compound_order(head);
|
|
|
|
continue;
|
2018-04-11 07:30:07 +08:00
|
|
|
} else if (PageTransHuge(page))
|
2017-09-09 07:11:15 +08:00
|
|
|
pfn = page_to_pfn(compound_head(page))
|
|
|
|
+ hpage_nr_pages(page) - 1;
|
2013-09-12 05:22:09 +08:00
|
|
|
|
2011-05-25 08:12:19 +08:00
|
|
|
if (!get_page_unless_zero(page))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
|
|
|
/*
|
2017-02-25 06:57:39 +08:00
|
|
|
* We can skip free pages. And we can deal with pages on
|
|
|
|
* LRU and non-lru movable pages.
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2017-02-25 06:57:39 +08:00
|
|
|
if (PageLRU(page))
|
|
|
|
ret = isolate_lru_page(page);
|
|
|
|
else
|
|
|
|
ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
if (!ret) { /* Success */
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
list_add_tail(&page->lru, &source);
|
2007-10-16 16:26:12 +08:00
|
|
|
move_pages--;
|
2017-02-25 06:57:39 +08:00
|
|
|
if (!__PageMovable(page))
|
|
|
|
inc_node_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
2009-12-15 09:58:11 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
} else {
|
|
|
|
#ifdef CONFIG_DEBUG_VM
|
2017-02-25 06:57:39 +08:00
|
|
|
pr_alert("failed to isolate pfn %lx\n", pfn);
|
|
|
|
dump_page(page, "isolation failed");
|
2007-10-16 16:26:12 +08:00
|
|
|
#endif
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
2011-03-31 09:57:33 +08:00
|
|
|
/* Because we don't have big zone->lock. we should
|
2010-10-27 05:22:10 +08:00
|
|
|
check this again here. */
|
|
|
|
if (page_count(page)) {
|
|
|
|
not_managed++;
|
2010-10-27 05:22:10 +08:00
|
|
|
ret = -EBUSY;
|
2010-10-27 05:22:10 +08:00
|
|
|
break;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
2010-10-27 05:22:10 +08:00
|
|
|
if (!list_empty(&source)) {
|
|
|
|
if (not_managed) {
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2010-10-27 05:22:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2012-10-09 07:32:54 +08:00
|
|
|
|
2016-07-29 06:48:53 +08:00
|
|
|
/* Allocate a new page from the nearest neighbor node */
|
|
|
|
ret = migrate_pages(&source, new_node_page, NULL, 0,
|
2013-02-23 08:35:14 +08:00
|
|
|
MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
|
2010-10-27 05:22:10 +08:00
|
|
|
if (ret)
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* remove from free_area[] and mark all as Reserved.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
__offline_isolated_pages(start, start + nr_pages);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2009-09-23 07:45:46 +08:00
|
|
|
walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
|
2007-10-16 16:26:12 +08:00
|
|
|
offline_isolated_pages_cb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check all pages in range, recoreded as memory resource, are isolated.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
long offlined = *(long *)data;
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
|
2007-10-16 16:26:12 +08:00
|
|
|
offlined = nr_pages;
|
|
|
|
if (!ret)
|
|
|
|
*(long *)data += offlined;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long
|
|
|
|
check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
long offlined = 0;
|
|
|
|
int ret;
|
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
|
2007-10-16 16:26:12 +08:00
|
|
|
check_pages_isolated_cb);
|
|
|
|
if (ret < 0)
|
|
|
|
offlined = (long)ret;
|
|
|
|
return offlined;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
static int __init cmdline_parse_movable_node(char *p)
|
|
|
|
{
|
2017-07-07 06:41:05 +08:00
|
|
|
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
|
2014-01-22 07:49:35 +08:00
|
|
|
movable_node_enabled = true;
|
2017-07-07 06:41:05 +08:00
|
|
|
#else
|
|
|
|
pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
|
|
|
|
#endif
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
early_param("movable_node", cmdline_parse_movable_node);
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when offline memory */
|
|
|
|
static void node_states_check_changes_offline(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
unsigned long present_pages = 0;
|
|
|
|
enum zone_type zt, zone_last = ZONE_NORMAL;
|
|
|
|
|
|
|
|
/*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_NORMAL,
|
|
|
|
* set zone_last to ZONE_NORMAL.
|
2012-12-12 08:01:03 +08:00
|
|
|
*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we don't have HIGHMEM nor movable node,
|
|
|
|
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
|
|
|
|
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (N_MEMORY == N_NORMAL_MEMORY)
|
2012-12-12 08:01:03 +08:00
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check whether node_states[N_NORMAL_MEMORY] will be changed.
|
|
|
|
* If the memory to be offline is in a zone of 0...zone_last,
|
|
|
|
* and it is the last present memory, 0...zone_last will
|
|
|
|
* become empty after offline , thus we can determind we will
|
|
|
|
* need to clear the node from node_states[N_NORMAL_MEMORY].
|
|
|
|
*/
|
|
|
|
for (zt = 0; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid_normal = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
|
|
|
* If we have movable node, node_states[N_HIGH_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_HIGHMEM,
|
|
|
|
* set zone_last to ZONE_HIGHMEM.
|
|
|
|
*
|
|
|
|
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_MOVABLE,
|
|
|
|
* set zone_last to ZONE_MOVABLE.
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_HIGHMEM;
|
|
|
|
if (N_MEMORY == N_HIGH_MEMORY)
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
for (; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid_high = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid_high = -1;
|
|
|
|
#else
|
|
|
|
arg->status_change_nid_high = arg->status_change_nid_normal;
|
|
|
|
#endif
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
|
|
|
* node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check whether node_states[N_HIGH_MEMORY] will be changed
|
|
|
|
* If we try to offline the last present @nr_pages from the node,
|
|
|
|
* we can determind we will need to clear the node from
|
|
|
|
* node_states[N_HIGH_MEMORY].
|
|
|
|
*/
|
|
|
|
for (; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_clear_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_clear_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
if ((N_MEMORY != N_NORMAL_MEMORY) &&
|
|
|
|
(arg->status_change_nid_high >= 0))
|
2012-12-12 08:01:03 +08:00
|
|
|
node_clear_state(node, N_HIGH_MEMORY);
|
2012-12-13 05:51:49 +08:00
|
|
|
|
|
|
|
if ((N_MEMORY != N_HIGH_MEMORY) &&
|
|
|
|
(arg->status_change_nid >= 0))
|
|
|
|
node_clear_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2012-10-09 07:33:58 +08:00
|
|
|
static int __ref __offline_pages(unsigned long start_pfn,
|
2017-11-16 09:33:38 +08:00
|
|
|
unsigned long end_pfn)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
2017-11-16 09:33:38 +08:00
|
|
|
unsigned long pfn, nr_pages;
|
2007-10-16 16:26:12 +08:00
|
|
|
long offlined_pages;
|
2017-11-16 09:33:34 +08:00
|
|
|
int ret, node;
|
2013-07-04 06:02:11 +08:00
|
|
|
unsigned long flags;
|
2017-02-04 05:13:23 +08:00
|
|
|
unsigned long valid_start, valid_end;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone;
|
2007-10-22 07:41:36 +08:00
|
|
|
struct memory_notify arg;
|
2007-10-16 16:26:12 +08:00
|
|
|
|
|
|
|
/* at least, alignment against pageblock is necessary */
|
|
|
|
if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
|
|
|
|
return -EINVAL;
|
|
|
|
if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
|
|
|
|
return -EINVAL;
|
|
|
|
/* This makes hotplug much easier...and readable.
|
|
|
|
we assume this for now. .*/
|
2017-02-04 05:13:23 +08:00
|
|
|
if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end))
|
2007-10-16 16:26:12 +08:00
|
|
|
return -EINVAL;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2017-02-04 05:13:23 +08:00
|
|
|
zone = page_zone(pfn_to_page(valid_start));
|
2007-10-22 07:41:36 +08:00
|
|
|
node = zone_to_nid(zone);
|
|
|
|
nr_pages = end_pfn - start_pfn;
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/* set above range as isolated */
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = start_isolate_page_range(start_pfn, end_pfn,
|
|
|
|
MIGRATE_MOVABLE, true);
|
2007-10-16 16:26:12 +08:00
|
|
|
if (ret)
|
2015-04-15 06:45:11 +08:00
|
|
|
return ret;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
arg.start_pfn = start_pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_offline(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
ret = memory_notify(MEM_GOING_OFFLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
|
|
|
goto failed_removal;
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
pfn = start_pfn;
|
|
|
|
repeat:
|
|
|
|
/* start memory hot removal */
|
|
|
|
ret = -EINTR;
|
|
|
|
if (signal_pending(current))
|
|
|
|
goto failed_removal;
|
2017-11-16 09:33:34 +08:00
|
|
|
|
|
|
|
cond_resched();
|
2018-02-01 08:16:19 +08:00
|
|
|
lru_add_drain_all();
|
2017-11-16 09:33:34 +08:00
|
|
|
drain_all_pages(zone);
|
2007-10-16 16:26:12 +08:00
|
|
|
|
2013-09-12 05:22:09 +08:00
|
|
|
pfn = scan_movable_pages(start_pfn, end_pfn);
|
|
|
|
if (pfn) { /* We have movable pages */
|
2007-10-16 16:26:12 +08:00
|
|
|
ret = do_migrate_range(pfn, end_pfn);
|
2017-11-16 09:33:34 +08:00
|
|
|
goto repeat;
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
2017-11-16 09:33:34 +08:00
|
|
|
|
2013-09-12 05:22:09 +08:00
|
|
|
/*
|
|
|
|
* dissolve free hugepages in the memory block before doing offlining
|
|
|
|
* actually in order to make hugetlbfs's object counting consistent.
|
|
|
|
*/
|
2016-10-08 08:01:10 +08:00
|
|
|
ret = dissolve_free_huge_pages(start_pfn, end_pfn);
|
|
|
|
if (ret)
|
|
|
|
goto failed_removal;
|
2007-10-16 16:26:12 +08:00
|
|
|
/* check again */
|
|
|
|
offlined_pages = check_pages_isolated(start_pfn, end_pfn);
|
2017-11-16 09:33:34 +08:00
|
|
|
if (offlined_pages < 0)
|
|
|
|
goto repeat;
|
2016-03-18 05:19:35 +08:00
|
|
|
pr_info("Offlined Pages %ld\n", offlined_pages);
|
2012-09-20 09:48:02 +08:00
|
|
|
/* Ok, all of our target is isolated.
|
2007-10-16 16:26:12 +08:00
|
|
|
We cannot do rollback at this point. */
|
|
|
|
offline_isolated_pages(start_pfn, end_pfn);
|
2007-11-15 08:59:12 +08:00
|
|
|
/* reset pagetype flags and makes migrate type to be MOVABLE */
|
2012-04-03 21:06:15 +08:00
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* removal success */
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->zone_pgdat->node_present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2011-05-25 08:11:32 +08:00
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
2012-10-09 07:31:51 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2012-08-01 07:43:32 +08:00
|
|
|
zone_pcp_reset(zone);
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(NULL);
|
2012-10-09 07:31:51 +08:00
|
|
|
} else
|
|
|
|
zone_pcp_update(zone);
|
2012-08-01 07:43:32 +08:00
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_clear_node(node, &arg);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (arg.status_change_nid >= 0) {
|
2009-12-15 09:58:33 +08:00
|
|
|
kswapd_stop(node);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_stop(node);
|
|
|
|
}
|
2009-06-17 06:32:50 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
memory_notify(MEM_OFFLINE, &arg);
|
2007-10-16 16:26:12 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
failed_removal:
|
2016-03-18 05:19:35 +08:00
|
|
|
pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
|
|
|
|
(unsigned long long) start_pfn << PAGE_SHIFT,
|
|
|
|
((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
|
2007-10-22 07:41:36 +08:00
|
|
|
memory_notify(MEM_CANCEL_OFFLINE, &arg);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* pushback to free area */
|
2012-04-03 21:06:15 +08:00
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-10-19 11:25:58 +08:00
|
|
|
|
2017-09-07 07:20:37 +08:00
|
|
|
/* Must be protected by mem_hotplug_begin() or a device_lock */
|
2012-10-09 07:33:58 +08:00
|
|
|
int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
2017-11-16 09:33:38 +08:00
|
|
|
return __offline_pages(start_pfn, start_pfn + nr_pages);
|
2012-10-09 07:33:58 +08:00
|
|
|
}
|
2013-05-08 06:29:49 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2012-10-09 07:33:58 +08:00
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
/**
|
|
|
|
* walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
|
|
|
|
* @start_pfn: start pfn of the memory range
|
2013-04-30 06:06:16 +08:00
|
|
|
* @end_pfn: end pfn of the memory range
|
2013-02-23 08:32:54 +08:00
|
|
|
* @arg: argument passed to func
|
|
|
|
* @func: callback for each memory section walked
|
|
|
|
*
|
|
|
|
* This function walks through all present mem sections in range
|
|
|
|
* [start_pfn, end_pfn) and call func on each mem section.
|
|
|
|
*
|
|
|
|
* Returns the return value of func.
|
|
|
|
*/
|
2013-05-08 06:29:49 +08:00
|
|
|
int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
|
2013-02-23 08:32:54 +08:00
|
|
|
void *arg, int (*func)(struct memory_block *, void *))
|
2008-10-19 11:25:58 +08:00
|
|
|
{
|
2012-10-09 07:34:01 +08:00
|
|
|
struct memory_block *mem = NULL;
|
|
|
|
struct mem_section *section;
|
|
|
|
unsigned long pfn, section_nr;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
section_nr = pfn_to_section_nr(pfn);
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
section = __nr_to_section(section_nr);
|
|
|
|
/* same memblock? */
|
|
|
|
if (mem)
|
|
|
|
if ((section_nr >= mem->start_section_nr) &&
|
|
|
|
(section_nr <= mem->end_section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
mem = find_memory_block_hinted(section, mem);
|
|
|
|
if (!mem)
|
|
|
|
continue;
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
ret = func(mem, arg);
|
2012-10-09 07:34:01 +08:00
|
|
|
if (ret) {
|
2013-02-23 08:32:54 +08:00
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
return ret;
|
2012-10-09 07:34:01 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mem)
|
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-08 06:29:49 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-11-13 07:07:20 +08:00
|
|
|
static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
|
|
|
int ret = !is_memblock_offlined(mem);
|
|
|
|
|
2013-04-30 06:08:49 +08:00
|
|
|
if (unlikely(ret)) {
|
|
|
|
phys_addr_t beginpa, endpa;
|
|
|
|
|
|
|
|
beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
|
|
|
|
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
|
2016-03-18 05:19:47 +08:00
|
|
|
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
|
2013-04-30 06:08:49 +08:00
|
|
|
&beginpa, &endpa);
|
|
|
|
}
|
2013-02-23 08:32:54 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static int check_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_present_cpu(cpu) {
|
|
|
|
if (cpu_to_node(cpu) == pgdat->node_id)
|
|
|
|
/*
|
|
|
|
* the cpu on this node isn't removed, and we can't
|
|
|
|
* offline this node.
|
|
|
|
*/
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static void unmap_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:31 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_ACPI_NUMA
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
if (cpu_to_node(cpu) == pgdat->node_id)
|
|
|
|
numa_clear_node(cpu);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:31 +08:00
|
|
|
{
|
2013-09-12 05:21:50 +08:00
|
|
|
int ret;
|
2013-02-23 08:33:31 +08:00
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
ret = check_cpu_on_node(pgdat);
|
2013-02-23 08:33:31 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the node will be offlined when we come here, so we can clear
|
|
|
|
* the cpu_to_node() now.
|
|
|
|
*/
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
unmap_cpu_on_node(pgdat);
|
2013-02-23 08:33:31 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* try_offline_node
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
2013-09-12 05:21:50 +08:00
|
|
|
*
|
|
|
|
* Offline a node if all memory sections and cpus of the node are removed.
|
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call.
|
|
|
|
*/
|
2013-02-23 08:33:27 +08:00
|
|
|
void try_offline_node(int nid)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
2013-02-23 08:33:16 +08:00
|
|
|
pg_data_t *pgdat = NODE_DATA(nid);
|
|
|
|
unsigned long start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
|
2013-02-23 08:33:14 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
unsigned long section_nr = pfn_to_section_nr(pfn);
|
|
|
|
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* some memory sections of this node are not removed, and we
|
|
|
|
* can't offline node now.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
if (check_and_unmap_cpu_on_node(pgdat))
|
2013-02-23 08:33:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* all memory/cpu of this node are removed, we can offline this
|
|
|
|
* node now.
|
|
|
|
*/
|
|
|
|
node_set_offline(nid);
|
|
|
|
unregister_one_node(nid);
|
|
|
|
}
|
2013-02-23 08:33:27 +08:00
|
|
|
EXPORT_SYMBOL(try_offline_node);
|
2013-02-23 08:33:14 +08:00
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* remove_memory
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
|
|
|
* @start: physical address of the region to remove
|
|
|
|
* @size: size of the region to remove
|
2013-09-12 05:21:50 +08:00
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call, as required by
|
|
|
|
* try_offline_node().
|
|
|
|
*/
|
2013-05-27 18:58:46 +08:00
|
|
|
void __ref remove_memory(int nid, u64 start, u64 size)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
2013-05-27 18:58:46 +08:00
|
|
|
int ret;
|
memory-hotplug: try to offline the memory twice to avoid dependence
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.
When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.
This idea is suggested by KOSAKI Motohiro.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 08:32:50 +08:00
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
BUG_ON(check_hotplug_memory_range(start, size));
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
|
|
|
/*
|
2013-05-27 18:58:46 +08:00
|
|
|
* All memory blocks must be offlined before removing memory. Check
|
|
|
|
* whether all memory blocks in question are offline and trigger a BUG()
|
|
|
|
* if this is not the case.
|
2013-02-23 08:32:52 +08:00
|
|
|
*/
|
2013-05-27 18:58:46 +08:00
|
|
|
ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
|
2013-11-13 07:07:20 +08:00
|
|
|
check_memblock_offlined_cb);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (ret)
|
2013-05-27 18:58:46 +08:00
|
|
|
BUG();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
2013-02-23 08:32:56 +08:00
|
|
|
/* remove memmap entry */
|
|
|
|
firmware_map_remove(start, start + size, "System RAM");
|
2015-08-15 06:35:16 +08:00
|
|
|
memblock_free(start, size);
|
|
|
|
memblock_remove(start, size);
|
2013-02-23 08:32:56 +08:00
|
|
|
|
2017-12-29 15:53:55 +08:00
|
|
|
arch_remove_memory(start, size, NULL);
|
2013-02-23 08:32:58 +08:00
|
|
|
|
2013-02-23 08:33:14 +08:00
|
|
|
try_offline_node(nid);
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2008-10-19 11:25:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(remove_memory);
|
2013-06-02 04:24:07 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|