linux/mm
Shaohua Li ebc2a1a691 swap: make cluster allocation per-cpu
swap cluster allocation is to get better request merge to improve
performance.  But the cluster is shared globally, if multiple tasks are
doing swap, this will cause interleave disk access.  While multiple tasks
swap is quite common, for example, each numa node has a kswapd thread
doing swap and multiple threads/processes doing direct page reclaim.

ioscheduler can't help too much here, because tasks don't send swapout IO
down to block layer in the meantime.  Block layer does merge some IOs, but
a lot not, depending on how many tasks are doing swapout concurrently.  In
practice, I've seen a lot of small size IO in swapout workloads.

We makes the cluster allocation per-cpu here.  The interleave disk access
issue goes away.  All tasks swapout to their own cluster, so swapout will
become sequential, which can be easily merged to big size IO.  If one CPU
can't get its per-cpu cluster (for example, there is no free cluster
anymore in the swap), it will fallback to scan swap_map.  The CPU can
still continue swap.  We don't need recycle free swap entries of other
CPUs.

In my test (swap to a 2-disk raid0 partition), this improves around 10%
swapout throughput, and request size is increased significantly.

How does this impact swap readahead is uncertain though.  On one side,
page reclaim always isolates and swaps several adjancent pages, this will
make page reclaim write the pages sequentially and benefit readahead.  On
the other side, several CPU write pages interleave means the pages don't
live _sequentially_ but relatively _near_.  In the per-cpu allocation
case, if adjancent pages are written by different cpus, they will live
relatively _far_.  So how this impacts swap readahead depends on how many
pages page reclaim isolates and swaps one time.  If the number is big,
this patch will benefit swap readahead.  Of course, this is about
sequential access pattern.  The patch has no impact for random access
pattern, because the new cluster allocation algorithm is just for SSD.

Alternative solution is organizing swap layout to be per-mm instead of
this per-cpu approach.  In the per-mm layout, we allocate a disk range for
each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
has potential issues of lock contention if multiple reclaimers are swap
pages from one mm.  For a sequential workload, per-mm layout is better to
implement swap readahead, because pages from the mm are adjacent in disk.
But per-cpu layout isn't very bad in this workload, as page reclaim always
isolates and swaps several pages one time, such pages will still live in
disk sequentially and readahead can utilize this.  For a random workload,
per-mm layout isn't beneficial of request merge, because it's quite
possible pages from different mm are swapout in the meantime and IO can't
be merged in per-mm layout.  while with per-cpu layout we can merge
requests from any mm.  Considering random workload is more popular in
workloads with swap (and per-cpu approach isn't too bad for sequential
workload too), I'm choosing per-cpu layout.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:17 -07:00
..
backing-dev.c backing-dev: convert class code to use dev_groups 2013-08-19 21:22:34 -07:00
balloon_compaction.c
bootmem.c
bounce.c
cleancache.c
compaction.c
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap_xip.c
filemap.c direct-io: Handle O_(D)SYNC AIO 2013-09-04 09:23:46 -04:00
fremap.c mm: save soft-dirty bits on file pages 2013-08-13 17:57:48 -07:00
frontswap.c
highmem.c
huge_memory.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
hugetlb_cgroup.c cgroup: pass around cgroup_subsys_state instead of cgroup in file methods 2013-08-08 20:11:24 -04:00
hugetlb.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
hwpoison-inject.c
init-mm.c
internal.h
interval_tree.c
Kconfig Merge remote-tracking branch 'origin/next' into kvm-ppc-next 2013-08-29 00:41:59 +02:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
ksm.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
maccess.c
madvise.c mm/madvise.c: fix coding-style errors 2013-09-11 15:57:00 -07:00
Makefile
memblock.c
memcontrol.c Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-09-03 18:25:03 -07:00
memory_hotplug.c
memory-failure.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-09-06 09:36:28 -07:00
memory.c Merge 3.11-rc6 into char-misc-next 2013-08-18 20:40:33 -07:00
mempolicy.c mm: mempolicy: turn vma_set_policy() into vma_dup_policy() 2013-09-11 15:57:00 -07:00
mempool.c
migrate.c
mincore.c
mlock.c
mm_init.c
mmap.c mm: mmap_region: kill correct_wcount/inode, use allow_write_access() 2013-09-11 15:57:07 -07:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c
mremap.c mm: move_ptes -- Set soft dirty bit depending on pte type 2013-08-27 09:36:17 -07:00
msync.c
nobootmem.c
nommu.c
oom_kill.c mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). 2013-07-15 11:25:05 +09:30
page_alloc.c mm/page_alloc.c: use '__paginginit' instead of '__init' 2013-09-11 15:57:13 -07:00
page_cgroup.c
page_io.c
page_isolation.c page_isolation: Fix a comment typo in test_pages_isolated() 2013-08-20 13:03:41 +02:00
page-writeback.c
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c
pgtable-generic.c
process_vm_access.c
quicklist.c
readahead.c
rmap.c s390/mm: implement software referenced bits 2013-08-29 13:20:11 +02:00
shmem.c shm_mnt is as longterm as it gets, TYVM... 2013-09-03 22:50:27 -04:00
slab_common.c
slab.c
slab.h memcg: check that kmem_cache has memcg_params before accessing it 2013-08-28 19:26:38 -07:00
slob.c
slub.c mm: replace strict_strtoul() with kstrtoul() 2013-09-11 15:57:11 -07:00
sparse-vmemmap.c
sparse.c
swap_state.c
swap.c thp, mm: avoid PageUnevictable on active/inactive lru lists 2013-07-31 14:41:03 -07:00
swapfile.c swap: make cluster allocation per-cpu 2013-09-11 15:57:17 -07:00
truncate.c
util.c
vmalloc.c
vmpressure.c Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-09-03 18:25:03 -07:00
vmscan.c
vmstat.c mm: vmstats: track TLB flush stats on UP too 2013-09-11 15:57:09 -07:00
zbud.c mm: zbud: fix condition check on allocation size 2013-07-31 14:41:03 -07:00
zswap.c mm/zswap.c: get swapper address_space by using macro 2013-09-11 15:57:08 -07:00