fuse_dax_mem_range_init() does not need the address or the pfn of the
memory requested in dax_direct_access(). It is only calling direct
access to get the number of pages.
Remove the unused variables and stop requesting the kaddr and pfn from
dax_direct_access().
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210525172428.3634316-2-ira.weiny@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYORvYQAKCRDh3BK/laaZ
PCfvAQCbU+PW2RbwlqjZMet6w9qorh29XYe786P5pNRVbMYCygD+N45l66Sbd/Rz
7M7ioVDseyTW4dnLhb8SzSNB0zr6jQs=
=MDvD
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fixes for virtiofs submounts
- Misc fixes and cleanups
* tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
virtiofs: Fix spelling mistakes
fuse: use DIV_ROUND_UP helper macro for calculations
fuse: fix illegal access to inode with reused nodeid
fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)
fuse: Make fuse_fill_super_submount() static
fuse: Switch to fc_mount() for submounts
fuse: Call vfs_get_tree() for submounts
fuse: add dedicated filesystem context ops for submounts
virtiofs: propagate sync() to file server
fuse: reject internal errno
fuse: check connected before queueing on fpq->io
fuse: ignore PG_workingset after stealing
fuse: Fix infinite loop in sget_fc()
fuse: Fix crash if superblock of submount gets killed early
fuse: Fix crash in fuse_dentry_automount() error path
These functions implement the address_space ->set_page_dirty operation and
should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
get funny ideas about calling them directly.
Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future. It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.
[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]
Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want to allow submounts for the same fuse_conn, but with different
superblocks so that each of the submounts has its own device ID. To do
so, we need to split all mount-specific information off of fuse_conn
into a new fuse_mount structure, so that multiple mounts can share a
single fuse_conn.
We need to take care only to perform connection-level actions once (i.e.
when the fuse_conn and thus the first fuse_mount are established, or
when the last fuse_mount and thus the fuse_conn are destroyed). For
example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
last superblock is released.
To do so, we keep track of which fuse_mount is the root mount and
perform all fuse_conn-level actions only when this fuse_mount is
involved.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.
If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.
For reclaiming a range, as of now we need to hold following locks in
specified order.
down_write(&fi->i_mmap_sem);
down_write(&fi->dax->sem);
We look for a free range in following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is done along the lines of ext4 and xfs. I primarily wanted
->writepages hook at this time so that I could call into
dax_writeback_mapping_range(). This in turn will decide which pfns need to
be written back.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a mount option to allow using dax with virtio_fs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>