linux-next/Documentation/vm/split_page_table_lock

Split page table lock
=====================

Originally, mm->page_table_lock spinlock protected all page tables of the
mm_struct. But this approach leads to poor page fault scalability of
multi-threaded applications due high contention on the lock. To improve
scalability, split page table lock was introduced.

With split page table lock we have separate per-table lock to serialize
access to the table. At the moment we use split lock for PTE and PMD
tables. Access to higher level tables protected by mm->page_table_lock.

There are helpers to lock/unlock a table and other accessor functions:
 - pte_offset_map_lock()
	maps pte and takes PTE table lock, returns pointer to the taken
	lock;
 - pte_unmap_unlock()
	unlocks and unmaps PTE table;
 - pte_alloc_map_lock()
	allocates PTE table if needed and take the lock, returns pointer
	to taken lock or NULL if allocation failed;
 - pte_lockptr()
	returns pointer to PTE table lock;
 - pmd_lock()
	takes PMD table lock, returns pointer to taken lock;
 - pmd_lockptr()
	returns pointer to PMD table lock;

Split page table lock for PTE tables is enabled compile-time if
CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
If split lock is disabled, all tables guaded by mm->page_table_lock.

Split page table lock for PMD tables is enabled, if it's enabled for PTE
tables and the architecture supports it (see below).

Hugetlb and split page table lock
---------------------------------

Hugetlb can support several page sizes. We use split lock only for PMD
level, but not for PUD.

Hugetlb-specific helpers:
 - huge_pte_lock()
	takes pmd split lock for PMD_SIZE page, mm->page_table_lock
	otherwise;
 - huge_pte_lockptr()
	returns pointer to table lock;

Support of split page table lock by an architecture
---------------------------------------------------

There's no need in special enabling of PTE split page table lock:
everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
which must be called on PTE table allocation / freeing.

Make sure the architecture doesn't use slab allocator for page table
allocation: slab uses page->slab_cache for its pages.
This field shares storage with page->ptl.

PMD split lock only makes sense if you have more than two page table
levels.

PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
allocation and pgtable_pmd_page_dtor() on freeing.

Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().

With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.

NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
be handled properly.

page->ptl
---------

page->ptl is used to access split page table lock, where 'page' is struct
page of page containing the table. It shares storage with page->private
(and few other fields in union).

To avoid increasing size of struct page and have best performance, we use a
trick:
 - if spinlock_t fits into long, we use page->ptr as spinlock, so we
   can avoid indirect access and save a cache line.
 - if size of spinlock_t is bigger then size of long, we use page->ptl as
   pointer to spinlock_t and allocate it dynamically. This allows to use
   split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
   one more cache line for indirect access;

The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
pgtable_pmd_page_ctor() for PMD table.

Please, never access page->ptl directly -- use appropriate helper.
mm: dynamically allocate page->ptl if it cannot be embedded to struct page If split page table lock is in use, we embed the lock into struct page of table's page. We have to disable split lock, if spinlock_t is too big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC enabled. This patch add support for dynamic allocation of split page table lock if we can't embed it to struct page. page->ptl is unsigned long now and we use it as spinlock_t if sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t. The spinlock_t allocated in pgtable_page_ctor() for PTE table and in pgtable_pmd_page_ctor() for PMD table. All other helpers converted to support dynamically allocated page->ptl. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2013-11-15 06:31:51 +08:00			`Split page table lock`
			`=====================`

			`Originally, mm->page_table_lock spinlock protected all page tables of the`
			`mm_struct. But this approach leads to poor page fault scalability of`
			`multi-threaded applications due high contention on the lock. To improve`
			`scalability, split page table lock was introduced.`

			`With split page table lock we have separate per-table lock to serialize`
			`access to the table. At the moment we use split lock for PTE and PMD`
			`tables. Access to higher level tables protected by mm->page_table_lock.`

			`There are helpers to lock/unlock a table and other accessor functions:`
			`- pte_offset_map_lock()`
			`maps pte and takes PTE table lock, returns pointer to the taken`
			`lock;`
			`- pte_unmap_unlock()`
			`unlocks and unmaps PTE table;`
			`- pte_alloc_map_lock()`
			`allocates PTE table if needed and take the lock, returns pointer`
			`to taken lock or NULL if allocation failed;`
			`- pte_lockptr()`
			`returns pointer to PTE table lock;`
			`- pmd_lock()`
			`takes PMD table lock, returns pointer to taken lock;`
			`- pmd_lockptr()`
			`returns pointer to PMD table lock;`

			`Split page table lock for PTE tables is enabled compile-time if`
			`CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.`
			`If split lock is disabled, all tables guaded by mm->page_table_lock.`

			`Split page table lock for PMD tables is enabled, if it's enabled for PTE`
			`tables and the architecture supports it (see below).`

			`Hugetlb and split page table lock`
			`---------------------------------`

			`Hugetlb can support several page sizes. We use split lock only for PMD`
			`level, but not for PUD.`

			`Hugetlb-specific helpers:`
			`- huge_pte_lock()`
			`takes pmd split lock for PMD_SIZE page, mm->page_table_lock`
			`otherwise;`
			`- huge_pte_lockptr()`
			`returns pointer to table lock;`

			`Support of split page table lock by an architecture`
			`---------------------------------------------------`

			`There's no need in special enabling of PTE split page table lock:`
			`everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),`
			`which must be called on PTE table allocation / freeing.`

			`Make sure the architecture doesn't use slab allocator for page table`
mm: make compound_head() robust Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2015-11-07 08:29:54 +08:00			`allocation: slab uses page->slab_cache for its pages.`
			`This field shares storage with page->ptl.`
mm: dynamically allocate page->ptl if it cannot be embedded to struct page If split page table lock is in use, we embed the lock into struct page of table's page. We have to disable split lock, if spinlock_t is too big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC enabled. This patch add support for dynamic allocation of split page table lock if we can't embed it to struct page. page->ptl is unsigned long now and we use it as spinlock_t if sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t. The spinlock_t allocated in pgtable_page_ctor() for PTE table and in pgtable_pmd_page_ctor() for PMD table. All other helpers converted to support dynamically allocated page->ptl. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2013-11-15 06:31:51 +08:00
			`PMD split lock only makes sense if you have more than two page table`
			`levels.`

			`PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table`
			`allocation and pgtable_pmd_page_dtor() on freeing.`

x86, mm: do not leak page->ptl for pmd page tables There are two code paths how page with pmd page table can be freed: pmd_free() and pmd_free_tlb(). I've missed the second one and didn't add page table destructor call there. It leads to leak of page->ptl for pmd page tables, if dynamically allocated page->ptl is in use. The patch adds the missed destructor and modifies documentation accordingly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrey Vagin <avagin@openvz.org> Tested-by: Andrey Vagin <avagin@openvz.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2013-11-22 06:32:09 +08:00			`Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and`
			`pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing`
			`paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().`
mm: dynamically allocate page->ptl if it cannot be embedded to struct page If split page table lock is in use, we embed the lock into struct page of table's page. We have to disable split lock, if spinlock_t is too big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC enabled. This patch add support for dynamic allocation of split page table lock if we can't embed it to struct page. page->ptl is unsigned long now and we use it as spinlock_t if sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t. The spinlock_t allocated in pgtable_page_ctor() for PTE table and in pgtable_pmd_page_ctor() for PMD table. All other helpers converted to support dynamically allocated page->ptl. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2013-11-15 06:31:51 +08:00
			`With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.`

			`NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must`
			`be handled properly.`

			`page->ptl`
			`---------`

			`page->ptl is used to access split page table lock, where 'page' is struct`
			`page of page containing the table. It shares storage with page->private`
			`(and few other fields in union).`

			`To avoid increasing size of struct page and have best performance, we use a`
			`trick:`
			`- if spinlock_t fits into long, we use page->ptr as spinlock, so we`
			`can avoid indirect access and save a cache line.`
			`- if size of spinlock_t is bigger then size of long, we use page->ptl as`
			`pointer to spinlock_t and allocate it dynamically. This allows to use`
			`split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs`
			`one more cache line for indirect access;`

			`The spinlock_t allocated in pgtable_page_ctor() for PTE table and in`
			`pgtable_pmd_page_ctor() for PMD table.`

			`Please, never access page->ptl directly -- use appropriate helper.`