Commit Graph

947 Commits

Author SHA1 Message Date
Kirill Smelkov
d4b13963f2 fuse: require /dev/fuse reads to have enough buffer capacity
A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. This way the
contract is that filesystem server should queue sys_reads with
4K+max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

We can make the problem easy to diagnose if we indicate via error return to
filesystem server when it is violating the contract.  This should not
practically cause problems because if a filesystem server is using shorter
buffer, writes to it were already very likely to cause EIO, and if the
filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
minimum buffer size.

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:07 +02:00
Kirill Smelkov
7640682e67 fuse: retrieve: cap requested size to negotiated max_write
FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header + that
max_write bytes. A filesystem server is free to set its max_write in
anywhere in the range between [1*page, fc->max_pages*page]. In particular
go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
corresponds to 128K. Libfuse also allows users to configure max_write, but
by default presets it to possible maximum.

If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract, will
be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server request
buffer. In turn the filesystem server could get stuck waiting indefinitely
for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
understood by clients as that NOTIFY_REPLY was queued and will be sent
back.

Cap requested size to negotiate max_write to avoid the problem.  This
aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.  This
way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:06 +02:00
Kirill Smelkov
ad2ba64dd4 fuse: allow filesystems to have precise control over data cache
On networked filesystems file data can be changed externally.  FUSE
provides notification messages for filesystem to inform kernel that
metadata or data region of a file needs to be invalidated in local page
cache. That provides the basis for filesystem implementations to invalidate
kernel cache explicitly based on observed filesystem-specific events.

FUSE has also "automatic" invalidation mode(*) when the kernel
automatically invalidates data cache of a file if it sees mtime change.  It
also automatically invalidates whole data cache of a file if it sees file
size being changed.

The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA.
However, due to probably historical reason, that capability controls only
whether mtime change should be resulting in automatic invalidation or
not. A change in file size always results in invalidating whole data cache
of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+).

The filesystem I write[1] represents data arrays stored in networked
database as local files suitable for mmap. It is read-only filesystem -
changes to data are committed externally via database interfaces and the
filesystem only glues data into contiguous file streams suitable for mmap
and traditional array processing. The files are big - starting from
hundreds gigabytes and more. The files change regularly, and frequently by
data being appended to their end. The size of files thus changes
frequently.

If a file was accessed locally and some part of its data got into page
cache, we want that data to stay cached unless there is memory pressure, or
unless corresponding part of the file was actually changed. However current
FUSE behaviour - when it sees file size change - is to invalidate the whole
file. The data cache of the file is thus completely lost even on small size
change, and despite that the filesystem server is careful to accurately
translate database changes into FUSE invalidation messages to kernel.

Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA
capability, indicates to kernel that it is fully responsible for data cache
invalidation, then the kernel won't invalidate files data cache on size
change and only truncate that cache to new size in case the size decreased.

(*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag",
eed2179efe "fuse: invalidate inode mapping if mtime changes"

(+) in writeback mode the kernel does not invalidate data cache on file
size change, but neither it allows the filesystem to set the size due to
external event (see 8373200b12 "fuse: Trust kernel i_size only")

[1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:06 +02:00
Kirill Smelkov
f2294482ff fuse: convert printk -> pr_*
Functions, like pr_err, are a more modern variant of printing compared to
printk. They could be used to denoise sources by using needed level in
the print function name, and by automatically inserting per-driver /
function / ... print prefix as defined by pr_fmt macro. pr_* are also
said to be used in Documentation/process/coding-style.rst and more
recent code - for example overlayfs - uses them instead of printk.

Convert CUSE and FUSE to use the new pr_* functions.

CUSE output stays completely unchanged, while FUSE output is amended a
bit for "trying to steal weird page" warning - the second line now comes
also with "fuse:" prefix. I hope it is ok.

Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:06 +02:00
Liu Bo
0cbade024b fuse: honor RLIMIT_FSIZE in fuse_file_fallocate
fstests generic/228 reported this failure that fuse fallocate does not
honor what 'ulimit -f' has set.

This adds the necessary inode_newsize_ok() check.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 05ba1f0823 ("fuse: add FALLOCATE operation")
Cc: <stable@vger.kernel.org> # v3.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:06 +02:00
Miklos Szeredi
9de5be06d0 fuse: fix writepages on 32bit
Writepage requests were cropped to i_size & 0xffffffff, which meant that
mmaped writes to any file larger than 4G might be silently discarded.

Fix by storing the file size in a properly sized variable (loff_t instead
of size_t).

Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Fixes: 6eaf4782eb ("fuse: writepages: crop secondary requests")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 17:05:06 +02:00
Linus Torvalds
6b3a707736 Merge branch 'page-refs' (page ref overflow)
Merge page ref overflow branch.

Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).

Admittedly it's not exactly easy.  To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers.  Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).

Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication.  So let's just do that.

* branch page-refs:
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit
2019-04-14 15:09:40 -07:00
Matthew Wilcox
15fab63e1e fs: prevent page refcount overflow in pipe_buf_get
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14 10:00:04 -07:00
Linus Torvalds
dfee9c257b fuse update for 5.1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXIdqOwAKCRDh3BK/laaZ
 PFRlAP0RZr7vDfGcZTXGApcIr63YDjzi8Gg1/Jhd0jrzLbKcdAD+P0d6bupWWwOl
 yGjVxY9LkXNJiTI2Q+Equ7AgMYvDcQk=
 =Lvcr
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:
 "Scalability and performance improvements, as well as minor bug fixes
  and cleanups"

* tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
  fuse: cache readdir calls if filesystem opts out of opendir
  fuse: support clients that don't implement 'opendir'
  fuse: lift bad inode checks into callers
  fuse: multiplex cached/direct_io file operations
  fuse add copy_file_range to direct io fops
  fuse: use iov_iter based generic splice helpers
  fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
  fuse: use atomic64_t for khctr
  fuse: clean up aborted
  fuse: Protect ff->reserved_req via corresponding fi->lock
  fuse: Protect fi->nlookup with fi->lock
  fuse: Introduce fi->lock to protect write related fields
  fuse: Convert fc->attr_version into atomic64_t
  fuse: Add fuse_inode argument to fuse_prepare_release()
  fuse: Verify userspace asks to requeue interrupt that we really sent
  fuse: Do some refactoring in fuse_dev_do_write()
  fuse: Wake up req->waitq of only if not background
  fuse: Optimize request_end() by not taking fiq->waitq.lock
  fuse: Kill fasync only if interrupt is queued in queue_interrupt()
  fuse: Remove stale comment in end_requests()
  ...
2019-03-12 14:46:26 -07:00
Nikolay Borisov
b5420237ec mm: refactor readahead defines in mm.h
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
simplifies the expression in every filesystem. Also rename the macro to
VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
VM_MIN_READAHEAD

[akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen]
Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Chad Austin
fabf7e0262 fuse: cache readdir calls if filesystem opts out of opendir
If a filesystem returns ENOSYS from opendir and thus opts out of
opendir and releasedir requests, it almost certainly would also like
readdir results cached. Default open_flags to FOPEN_KEEP_CACHE and
FOPEN_CACHE_DIR in that case.

With this patch, I've measured recursive directory enumeration across
large FUSE mounts to be faster than native mounts.

Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Chad Austin
d9a9ea94f7 fuse: support clients that don't implement 'opendir'
Allow filesystems to return ENOSYS from opendir, preventing the kernel from
sending opendir and releasedir messages in the future. This avoids
userspace transitions when filesystems don't need to keep track of state
per directory handle.

A new capability flag, FUSE_NO_OPENDIR_SUPPORT, parallels
FUSE_NO_OPEN_SUPPORT, indicating the new semantics for returning ENOSYS
from opendir.

Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
2f7b6f5bed fuse: lift bad inode checks into callers
Bad inode checks were done  done in various places, and move them into
fuse_file_{read|write}_iter().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
55752a3aba fuse: multiplex cached/direct_io file operations
This is cleanup, as well as allowing switching between I/O modes while the
file is open in the future.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
d4136d6075 fuse add copy_file_range to direct io fops
Nothing preventing copy_file_range to work on files opened with
FOPEN_DIRECT_IO.

Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
3c3db095b6 fuse: use iov_iter based generic splice helpers
The default splice implementation is grossly inefficient and the iter based
ones work just fine, so use those instead.  I've measured an 8x speedup for
splice write (with len = 128k).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Martin Raiber
23c94e1cdc fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
Switch to using the async directo IO code path in fuse_direct_read_iter()
and fuse_direct_write_iter().  This is especially important in connection
with loop devices with direct IO enabled as loop assumes async direct io is
actually async.

Signed-off-by: Martin Raiber <martin@urbackup.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
75126f5504 fuse: use atomic64_t for khctr
...to get rid of one more fc->lock use.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
eb98e3bdf3 fuse: clean up aborted
The only caller that needs fc->aborted set is fuse_conn_abort_write().
Setting fc->aborted is now racy (fuse_abort_conn() may already be in
progress or finished) but there's no reason to care.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
6b675738ce fuse: Protect ff->reserved_req via corresponding fi->lock
This is rather natural action after previous patches, and it just decreases
load of fc->lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
c9d8f5f069 fuse: Protect fi->nlookup with fi->lock
This continues previous patch and introduces the same protection for
nlookup field.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
f15ecfef05 fuse: Introduce fi->lock to protect write related fields
To minimize contention of fc->lock, this patch introduces a new spinlock
for protection fuse_inode metadata:

fuse_inode:
	writectr
	writepages
	write_files
	queued_writes
	attr_version

inode:
	i_size
	i_nlink
	i_mtime
	i_ctime

Also, it protects the fields changed in fuse_change_attributes_common()
(too many to list).

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
4510d86fbb fuse: Convert fc->attr_version into atomic64_t
This patch makes fc->attr_version of atomic64_t type, so fc->lock won't be
needed to read or modify it anymore.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
ebf84d0c72 fuse: Add fuse_inode argument to fuse_prepare_release()
Here is preparation for next patches, which introduce new fi->lock for
protection of ff->write_entry linked into fi->write_files.

This patch just passes new argument to the function.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
b782911b52 fuse: Verify userspace asks to requeue interrupt that we really sent
When queue_interrupt() is called from fuse_dev_do_write(), it came from
userspace directly. Userspace may pass any request id, even the request's
we have not interrupted (or even background's request). This patch adds
sanity check to make kernel safe against that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
7407a10de5 fuse: Do some refactoring in fuse_dev_do_write()
This is needed for next patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
5e0fed717a fuse: Wake up req->waitq of only if not background
Currently, we wait on req->waitq in request_wait_answer() function only,
and it's never used for background requests.  Since wake_up() is not a
light-weight macros, instead of this, it unfolds in really called function,
which makes locking operations taking some cpu cycles, let's avoid its call
for the case we definitely know it's completely useless.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
217316a601 fuse: Optimize request_end() by not taking fiq->waitq.lock
We take global fiq->waitq.lock every time, when we are in this function,
but interrupted requests are just small subset of all requests. This patch
optimizes request_end() and makes it to take the lock when it's really
needed.

queue_interrupt() needs small change for that. After req is linked to
interrupt list, we do smp_mb() and check for FR_FINISHED again. In case of
FR_FINISHED bit has appeared, we remove req and leave the function:

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
8da6e91832 fuse: Kill fasync only if interrupt is queued in queue_interrupt()
We should sent signal only in case of interrupt is really queued.  Not a
real problem, but this makes the code clearer and intuitive.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
340617508d fuse: Remove stale comment in end_requests()
Function end_requests() does not take fc->lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Kirill Tkhai
c5de16cca2 fuse: Replace page without copying in fuse_writepage_in_flight()
It looks like we can optimize page replacement and avoid copying by simple
updating the request's page.

[SzM: swap with new request's tmp page to avoid use after free.]

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
e2653bd53a fuse: fix leaked aux requests
Auxiliary requests chained on req->misc.write.next may be leaked on
truncate.  Free these as well if the parent request was truncated off.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
419234d595 fuse: only reuse auxiliary request in fuse_writepage_in_flight()
Don't reuse the queued request, even if it only contains a single page.
This is needed because previous locking changes (spliting out
fiq->waitq.lock from fc->lock) broke the assumption that request will
remain in FR_PENDING at least until the new page contents are copied.

This fix removes a slight optimization for a rare corner case, so we really
shoudln't care.

Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
7f305ca192 fuse: clean up fuse_writepage_in_flight()
Restructure the function to better separate the locked and the unlocked
parts.  Use the "old_req" local variable to mean only the queued request,
and not any auxiliary requests added onto its misc.write.next list.  These
changes are in preparation for the following patch.

Also turn BUG_ON instances into WARN_ON and add a header comment explaining
what the function does.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
2fe93bd432 fuse: extract fuse_find_writeback() helper
Call this from fuse_range_is_writeback() and fuse_writepage_in_flight().

Turn a BUG_ON() into a WARN_ON() in the process.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
a2ebba8241 fuse: decrement NR_WRITEBACK_TEMP on the right page
NR_WRITEBACK_TEMP is accounted on the temporary page in the request, not
the page cache page.

Fixes: 8b284dc472 ("fuse: writepages: handle same page rewrites")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Jann Horn
9509941e9c fuse: call pipe_buf_release() under pipe lock
Some of the pipe_buf_release() handlers seem to assume that the pipe is
locked - in particular, anon_pipe_buf_release() accesses pipe->tmp_page
without taking any extra locks. From a glance through the callers of
pipe_buf_release(), it looks like FUSE is the only one that calls
pipe_buf_release() without having the pipe locked.

This bug should only lead to a memory leak, nothing terrible.

Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Miklos Szeredi
8a3177db59 cuse: fix ioctl
cuse_process_init_reply() doesn't initialize fc->max_pages and thus all
cuse bases ioctls fail with ENOMEM.

Reported-by: Andreas Steinmetz <ast@domdv.de>
Fixes: 5da784cce4 ("fuse: add max_pages to init_out")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Miklos Szeredi
97e1532ef8 fuse: handle zero sized retrieve correctly
Dereferencing req->page_descs[0] will Oops if req->max_pages is zero.

Reported-by: syzbot+c1e36d30ee3416289cc0@syzkaller.appspotmail.com
Tested-by: syzbot+c1e36d30ee3416289cc0@syzkaller.appspotmail.com
Fixes: b2430d7567 ("fuse: add per-page descriptor <offset, length> to fuse_req")
Cc: <stable@vger.kernel.org> # v3.9
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Arun KS
ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Chad Austin
2e64ff154c fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS
When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection.

Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this
incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent
to userspace.

Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR
inside of fuse_file_put.

Fixes: 7678ac5061 ("fuse: support clients that don't implement 'open'")
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-12-11 21:47:28 +01:00
Takeshi Misawa
d72f70da60 fuse: Fix memory leak in fuse_dev_free()
When ntfs is unmounted, the following leak is
reported by kmemleak.

kmemleak report:

unreferenced object 0xffff880052bf4400 (size 4096):
  comm "mount.ntfs", pid 16530, jiffies 4294861127 (age 3215.836s)
  hex dump (first 32 bytes):
    00 44 bf 52 00 88 ff ff 00 44 bf 52 00 88 ff ff  .D.R.....D.R....
    10 44 bf 52 00 88 ff ff 10 44 bf 52 00 88 ff ff  .D.R.....D.R....
  backtrace:
    [<00000000bf4a2f8d>] fuse_fill_super+0xb22/0x1da0 [fuse]
    [<000000004dde0f0c>] mount_bdev+0x263/0x320
    [<0000000025aebc66>] mount_fs+0x82/0x2bf
    [<0000000042c5a6be>] vfs_kern_mount.part.33+0xbf/0x480
    [<00000000ed10cd5b>] do_mount+0x3de/0x2ad0
    [<00000000d59ff068>] ksys_mount+0xba/0xd0
    [<000000001bda1bcc>] __x64_sys_mount+0xba/0x150
    [<00000000ebe26304>] do_syscall_64+0x151/0x490
    [<00000000d25f2b42>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<000000002e0abd2c>] 0xffffffffffffffff

fuse_dev_alloc() allocate fud->pq.processing.
But this hash table is not freed.

Fix this by freeing fud->pq.processing.

Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: be2ff42c5d ("fuse: Use hash table to link processing request")
2018-12-10 09:57:54 +01:00
Miklos Szeredi
d233c7dd16 fuse: fix revalidation of attributes for permission check
fuse_invalidate_attr() now sets fi->inval_mask instead of fi->i_time, hence
we need to check the inval mask in fuse_permission() as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 2f1e81965f ("fuse: allow fine grained attr cache invaldation")
2018-12-03 10:14:43 +01:00
Miklos Szeredi
a9c2d1e82f fuse: fix fsync on directory
Commit ab2257e994 ("fuse: reduce size of struct fuse_inode") moved parts
of fields related to writeback on regular file and to directory caching
into a union.  However fuse_fsync_common() called from fuse_dir_fsync()
touches some writeback related fields, resulting in a crash.

Move writeback related parts from fuse_fsync_common() to fuse_fysnc().

Reported-by: Brett Girton <btgirton@gmail.com>
Tested-by: Brett Girton <btgirton@gmail.com>
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-12-03 10:14:43 +01:00
Myungho Jung
4fc4bb796b fuse: Add bad inode check in fuse_destroy_inode()
make_bad_inode() sets inode->i_mode to S_IFREG if I/O error is detected
in fuse_do_getattr()/fuse_do_setattr(). If the inode is not a regular
file, write_files and queued_writes in fuse_inode are not initialized
and have NULL or invalid pointers written by other members in a union.
So, list_empty() returns false in fuse_destroy_inode(). Add
is_bad_inode() to check if make_bad_inode() was called.

Reported-by: syzbot+b9c89b84423073226299@syzkaller.appspotmail.com
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-11-22 10:20:19 +01:00
Lukas Czerner
ebacb81273 fuse: fix use-after-free in fuse_direct_IO()
In async IO blocking case the additional reference to the io is taken for
it to survive fuse_aio_complete(). In non blocking case this additional
reference is not needed, however we still reference io to figure out
whether to wait for completion or not. This is wrong and will lead to
use-after-free. Fix it by storing blocking information in separate
variable.

This was spotted by KASAN when running generic/208 fstest.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 744742d692 ("fuse: Add reference counting for fuse_io_priv")
Cc: <stable@vger.kernel.org> # v4.6
2018-11-09 15:52:17 +01:00
Miklos Szeredi
2d84a2d19b fuse: fix possibly missed wake-up after abort
In current fuse_drop_waiting() implementation it's possible that
fuse_wait_aborted() will not be woken up in the unlikely case that
fuse_abort_conn() + fuse_wait_aborted() runs in between checking
fc->connected and calling atomic_dec(&fc->num_waiting).

Do the atomic_dec_and_test() unconditionally, which also provides the
necessary barrier against reordering with the fc->connected check.

The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
barrier after resetting fc->connected.  However, this is not a performance
sensitive path, and adding the explicit barrier makes it easier to
document.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Cc: <stable@vger.kernel.org> #v4.19
2018-11-09 15:52:16 +01:00
Miklos Szeredi
7fabaf3034 fuse: fix leaked notify reply
fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted).  Don't leak request reference in this
case.  Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.

Fixes: 2d45ba381a ("fuse: add retrieve request")
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
2018-11-09 15:52:16 +01:00
Linus Torvalds
9931a07d51 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  missing bits of "iov_iter: Separate type from direction and use accessor functions"
  afs: Probe multiple fileservers simultaneously
  afs: Fix callback handling
  afs: Eliminate the address pointer from the address list cursor
  afs: Allow dumping of server cursor on operation failure
  afs: Implement YFS support in the fs client
  afs: Expand data structure fields to support YFS
  afs: Get the target vnode in afs_rmdir() and get a callback on it
  afs: Calc callback expiry in op reply delivery
  afs: Fix FS.FetchStatus delivery from updating wrong vnode
  afs: Implement the YFS cache manager service
  afs: Remove callback details from afs_callback_break struct
  afs: Commit the status on a new file/dir/symlink
  afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
  afs: Don't invoke the server to read data beyond EOF
  afs: Add a couple of tracepoints to log I/O errors
  afs: Handle EIO from delivery function
  afs: Fix TTL on VL server and address lists
  afs: Implement VL server rotation
  afs: Improve FS server rotation error handling
  ...
2018-11-01 19:58:52 -07:00
David Howells
00e2370744 iov_iter: Use accessor function
Use accessor functions to access an iterator's type and direction.  This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:40:44 +01:00