linux/fs
Ira Weiny 932f4a630a mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
..
9p 9p: switch to ->free_inode() 2019-05-01 22:43:23 -04:00
adfs Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
affs Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
afs AFS Development 2019-05-07 20:51:58 -07:00
autofs autofs: fix use-after-free in lockless ->d_manage() 2019-04-09 19:18:19 -04:00
befs befs: switch to ->free_inode() 2019-05-01 22:43:23 -04:00
bfs bfs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
btrfs for-5.2/block-20190507 2019-05-07 18:14:36 -07:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-11-30 16:00:58 +00:00
ceph Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
cifs cifs: update module internal version number 2019-05-07 23:24:56 -05:00
coda coda: switch to ->free_inode() 2019-05-01 22:43:26 -04:00
configfs fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
cramfs Make the Cramfs code more robust against filesystem corruptions, 2018-10-30 12:46:25 -07:00
crypto Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
debugfs Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
devpts fs/devpts: always delete dcache dentry-s in dput() 2019-01-24 13:38:30 -05:00
dlm genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
ecryptfs Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 10:57:05 -07:00
efivarfs
efs efs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
exportfs exportfs: do not read dentry after free 2018-11-23 09:08:17 -05:00
ext2 ext2: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
ext4 Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
f2fs Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
fat fat: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
freevxfs freevxfs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
fscache fscache: fix race between enablement and dropping of object 2018-11-30 15:57:31 +00:00
fuse Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 10:57:05 -07:00
gfs2 We've got the following patches ready for this merge window: 2019-05-08 13:16:07 -07:00
hfs hfs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
hfsplus hfsplus: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
hostfs This pull request contains the following changes for UML: 2019-05-12 17:52:13 -04:00
hpfs hpfs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
hugetlbfs hugetlb: make use of ->free_inode() 2019-05-01 22:43:27 -04:00
isofs isofs: switch to ->free_inode() 2019-05-01 22:43:24 -04:00
jbd2 jbd2: check superblock mapped prior to committing 2019-04-06 18:57:40 -04:00
jffs2 Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
jfs Several minor jfs fixes 2019-05-07 11:37:27 -07:00
kernfs Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
lockd lockd: Store the lockd client credential in struct nlm_host 2019-04-26 17:51:23 -04:00
minix minix: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
nfs NFS client updates for Linux 5.2 2019-05-09 14:33:15 -07:00
nfs_common
nfsd NFS client updates for Linux 5.2 2019-05-09 14:33:15 -07:00
nilfs2 nilfs2: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
nls
notify Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
ntfs ntfs: switch to ->free_inode() 2019-05-01 22:43:26 -04:00
ocfs2 ocfs2: fix ocfs2 read inode data panic in ocfs2_iget 2019-05-14 09:47:44 -07:00
omfs
openpromfs openpromfs: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
orangefs Orangefs: This pull request includes one fix and our "Orangefs through 2019-05-09 09:37:25 -07:00
overlayfs Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
proc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
pstore treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
qnx4 qnx4: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
qnx6 qnx6: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
quota quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. 2018-12-18 18:29:15 +01:00
ramfs
reiserfs reiserfs: convert to ->free_inode() 2019-05-01 22:43:25 -04:00
romfs romfs: convert to ->free_inode() 2019-05-01 22:43:25 -04:00
squashfs squashfs: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
sysfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-16 10:31:02 -07:00
sysv Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
tracefs
ubifs This pull request contains the following changes for UBI/UBIFS 2019-05-12 18:16:31 -04:00
udf udf: switch to ->free_inode() 2019-05-01 22:43:25 -04:00
ufs Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
unicode unicode: refactor the rule for regenerating utf8data.h 2019-04-28 13:45:36 -04:00
xfs for-5.2/block-20190507 2019-05-07 18:14:36 -07:00
aio.c aio: use kmem_cache_free() instead of kfree() 2019-04-04 20:13:59 -04:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c a.out: remove core dumping support 2019-03-05 10:00:35 -08:00
binfmt_elf_fdpic.c
binfmt_elf.c fs/binfmt_elf.c: spread const a little 2019-03-07 18:32:01 -08:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c exec: load_script: Do not exec truncated interpreter path 2019-02-18 16:49:36 -08:00
block_dev.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:50:27 -07:00
buffer.c iomap: Fix use-after-free error in page_done callback 2019-05-01 07:47:37 -07:00
char_dev.c chardev: update comment based on the code 2019-04-02 17:49:58 +02:00
compat_binfmt_elf.c
compat_ioctl.c media updates for v4.20-rc1 2018-10-29 14:29:58 -07:00
compat.c
coredump.c
d_path.c
dax.c mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses 2019-05-14 09:47:44 -07:00
dcache.c Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
dcookies.c
direct-io.c block: remove the i argument to bio_for_each_segment_all 2019-04-30 09:26:13 -06:00
drop_caches.c fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() 2019-02-01 15:46:24 -08:00
eventfd.c
eventpoll.c epoll: use rwlock in order to reduce ep_poll_callback() contention 2019-03-07 18:32:01 -08:00
exec.c exec: increase BINPRM_BUF_SIZE to 256 2019-03-07 18:32:01 -08:00
fcntl.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
fhandle.c
file_table.c vfs: syscall: Add open_tree(2) to reference or clone a mount 2019-03-20 18:49:06 -04:00
file.c io_uring-2019-03-06 2019-03-08 14:48:40 -08:00
filesystems.c vfs: Implement a filesystem superblock creation/configuration context 2019-02-28 03:29:26 -05:00
fs_context.c vfs: syscall: Add fsconfig() for configuring and managing a context 2019-03-20 18:49:06 -04:00
fs_parser.c fs: fs_parser: fix printk format warning 2019-03-29 10:01:38 -07:00
fs_pin.c
fs_struct.c
fs_types.c fs: common implementation of file type 2019-01-21 17:48:13 +01:00
fs-writeback.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-01-22 14:39:38 -07:00
fsopen.c vfs: syscall: Add fspick() to select a superblock for reconfiguration 2019-03-20 18:49:06 -04:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:50:27 -07:00
internal.h Merge branch 'work.mount-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:17:51 -07:00
io_uring.c mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM 2019-05-14 09:47:45 -07:00
ioctl.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
iomap.c for-5.2/block-20190507 2019-05-07 18:14:36 -07:00
Kconfig unicode: introduce UTF-8 character database 2019-04-25 13:38:44 -04:00
Kconfig.binfmt
libfs.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:50:27 -07:00
locks.c Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
Makefile Add as a feature case-insensitive directories (the casefold feature) 2019-05-07 21:12:44 -07:00
mbcache.c
mount.h saner handling of temporary namespaces 2019-01-30 17:44:07 -05:00
mpage.c block: remove the i argument to bio_for_each_segment_all 2019-04-30 09:26:13 -06:00
namei.c Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
namespace.c do_move_mount(): fix an unsafe use of is_anon_ns() 2019-05-09 02:32:50 -04:00
no-block.c
nsfs.c nsfs: unobfuscate 2019-04-09 19:20:57 -04:00
open.c vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files 2019-05-06 17:46:52 +03:00
pipe.c Merge branch 'page-refs' (page ref overflow) 2019-04-14 15:09:40 -07:00
pnode.c separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
pnode.h separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
posix_acl.c
proc_namespace.c
read_write.c vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files 2019-05-06 17:46:52 +03:00
readdir.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
select.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
seq_file.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
signalfd.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
splice.c There tracing fixes: 2019-04-26 11:09:55 -07:00
stack.c block: remove CONFIG_LBDAF 2019-04-06 10:48:35 -06:00
stat.c fs: move generic stat response attr handling to vfs_getattr_nosec 2019-02-01 01:55:45 -05:00
statfs.c vfs: add vfs_get_fsid() helper 2019-02-07 16:38:35 +01:00
super.c [fix] get rid of checking for absent device name in vfs_get_tree() 2019-04-28 21:34:21 -04:00
sync.c fs: add sync_file_range() helper 2019-05-02 14:08:53 -06:00
timerfd.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
userfaultfd.c userfaultfd/sysctl: add vm.unprivileged_userfaultfd 2019-05-14 09:47:45 -07:00
utimes.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
xattr.c