Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f76 ("block: Remove
request.write_hint") and commit c75e707fe1 ("block: remove the
per-bio/request write hint").
This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:
/* size: 112, cachelines: 2, members: 20 */
/* sum members: 110, holes: 1, sum holes: 2 */
/* last cacheline: 48 bytes */
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If IOCB_DIO_CALLER_COMP is set, utilize that to set kiocb->dio_complete
handler and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get
a callback from a safe task context.
Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:
fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth
shows the following results before and after this patch:
Stock Patched Diff
=======================================
QD1 155K 162K + 4.5%
QD2 290K 313K + 7.9%
QD4 533K 597K +12.0%
QD8 604K 827K +36.9%
QD16 615K 845K +37.4%
which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than gate whether or not we need to punt a dio completion to a
workqueue on whether the IO is a write or not, add an explicit flag for
it. For now we treat them the same, reads always set the flags and async
writes do not.
No functional changes in this patch.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocb->private is only used for polled IO, where the completer will
find the bio to poll through that field.
Assign it when we're submitting a polled bio, and get rid of the
dio->poll_bio indirection.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Whether we have a write back cache and are using FUA or don't have
a write back cache at all is the same situation. Treat them the same.
This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
two cases that provide stable writes:
1) Volatile write cache with FUA writes
2) Normal write without a volatile write cache
Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
update some of the FUA comments as well.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOMAP_DIO_DIRTY shifts by 31 bits, which makes UBSAN unhappy. Clean up
all the defines by making the shifted value an unsigned value.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the logic a bit easier to follow:
1) Add a release_bio out path, as everybody needs to touch that, and
have our bio ref check jump there if it's non-zero.
2) Add a kiocb local variable.
3) Add comments for each of the three conditions (sync, inline, or
async workqueue punt).
No functional changes in this patch.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
Use the common helpers for direct I/O page invalidation instead of open
coding the logic. This leads to a slight reordering of checks in
__iomap_dio_rw to keep the logic straight.
Link: https://lkml.kernel.org/r/20230601145904.1385409-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a helper to invalidate page cache after a dio write.
Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the ki_pos update down a bit to prepare for a better common helper
that invalidates pages based of an iocb.
Link: https://lkml.kernel.org/r/20230601145904.1385409-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ZERO_PAGE can't go away, no need to hold an extra reference.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-2-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for
aio dsync writes this is not useful anyway. For aio dsync case, we
we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC
doesn't let iomap_dio_complete() to call generic_write_sync(),
hence we may lose the sync write.
Hence kill this flag as it is not in use by any FS now.
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
splits and converts them to REQ_OP_ZONE_APPEND internally.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't transform the logical block size to a bit shift only to shift it
back to the original block size. Just use the size.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One of the goals is to reduce the overhead of using ->read_iter()
and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
has a surprising amount of overhead, in particular inside iocb_flags().
That's why the beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
=9UCV
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
the combination of a request operation and request flags.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-56-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-12-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New flag, equivalent to removal of IOCB_DSYNC from iocb flags.
This mimics what btrfs is doing (and that's what btrfs will
switch to). However, I'm not at all sure that we want to
suppress REQ_FUA for those - all btrfs hack really cares about
is suppression of generic_write_sync(). For now let's keep
the existing behaviour, but I really want to hear more detailed
arguments pro or contra.
[folded brain fix from willy]
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt
WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf
BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY
Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ
DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE
MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE
AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG
oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1
tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE
Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl
uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH
z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM=
=lv6P
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- subpage:
- support for PAGE_SIZE > 4K (previously only 64K)
- make it work with raid56
- repair super block num_devices automatically if it does not match
the number of device items
- defrag can convert inline extents to regular extents, up to now
inline files were skipped but the setting of mount option
max_inline could affect the decision logic
- zoned:
- minimal accepted zone size is explicitly set to 4MiB
- make zone reclaim less aggressive and don't reclaim if there are
enough free zones
- add per-profile sysfs tunable of the reclaim threshold
- allow automatic block group reclaim for non-zoned filesystems, with
sysfs tunables
- tree-checker: new check, compare extent buffer owner against owner
rootid
Performance:
- avoid blocking on space reservation when doing nowait direct io
writes (+7% throughput for reads and writes)
- NOCOW write throughput improvement due to refined locking (+3%)
- send: reduce pressure to page cache by dropping extent pages right
after they're processed
Core:
- convert all radix trees to xarray
- add iterators for b-tree node items
- support printk message index
- user bulk page allocation for extent buffers
- switch to bio_alloc API, use on-stack bios where convenient, other
bio cleanups
- use rw lock for block groups to favor concurrent reads
- simplify workques, don't allocate high priority threads for all
normal queues as we need only one
- refactor scrub, process chunks based on their constraints and
similarity
- allocate direct io structures on stack and pass around only
pointers, avoids allocation and reduces potential error handling
Fixes:
- fix count of reserved transaction items for various inode
operations
- fix deadlock between concurrent dio writes when low on free data
space
- fix a few cases when zones need to be finished
VFS, iomap:
- add helper to check if sb write has started (usable for assertions)
- new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"
* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
btrfs: zoned: introduce a minimal zone size 4M and reject mount
btrfs: allow defrag to convert inline extents to regular extents
btrfs: add "0x" prefix for unsupported optional features
btrfs: do not account twice for inode ref when reserving metadata units
btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
btrfs: send: avoid trashing the page cache
btrfs: send: keep the current inode open while processing it
btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
btrfs: move struct btrfs_dio_private to inode.c
btrfs: remove the disk_bytenr in struct btrfs_dio_private
btrfs: allocate dio_data on stack
iomap: add per-iomap_iter private data
iomap: allow the file system to provide a bio_set for direct I/O
btrfs: add a btrfs_dio_rw wrapper
btrfs: zoned: zone finish unused block group
btrfs: zoned: properly finish block group on metadata write
btrfs: zoned: finish block group when there are no more allocatable bytes left
btrfs: zoned: consolidate zone finish functions
btrfs: zoned: introduce btrfs_zoned_bg_is_full
btrfs: improve error reporting in lookup_inline_extent_backref
...
Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allow the file system to provide a specific bio_set for allocating
direct I/O bios. This will allow file systems that use the
->submit_io hook to stash away additional information for file system
use.
To make use of this additional space for information in the completion
path, the file system needs to override the ->bi_end_io callback and
then call back into iomap, so export iomap_dio_bio_end_io for that.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
from userspace sync io interface, then block layer tries to poll until
the bio is completed. But the current implementation calls
blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
timeout easily.
But looks no one reports this kind of issue, which should have been
triggered in normal io poll sanity test or blktests block/007 as
observed by Changhui, that means it is very likely that no one uses it
or no one cares it.
Also after io_uring is invented, io poll for sync dio becomes legacy
interface.
So ignore RWF_HIPRI hint for sync dio.
CC: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Reported-by: Changhui Zhong <czhong@redhat.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220420143110.2679002-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to check the FUA flag based on the block_device instead of
having to poke into the block layer internal request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
VaTzP71C1Q==
=efaq
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption.
There will be a merge conflict with the block pull request in
fs/iomap/direct-io.c, due to some bio interface cleanups. The merge
resolution is straightforward and can be found in linux-next.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYjgCfhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6vjAQDp8D8OKIyj67KjYwvyHpNy0fZhxeur
RexC0nDfd9BE/AD/fV6zpCglmsuGxqxL0jmqeczKXn2y7nRFmPciCBTi/wY=
=kwNd
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: update documentation for direct I/O support
f2fs: support direct I/O with fscrypt using blk-crypto
ext4: support direct I/O with fscrypt using blk-crypto
iomap: support direct I/O with fscrypt using blk-crypto
fscrypt: add functions for direct I/O support
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Add support for this to the iomap DIO implementation by calling
fscrypt_set_bio_crypt_ctx() to set encryption contexts on the bios.
Don't check for the rare case where a DUN (crypto data unit number)
discontiguity creates a boundary that bios must not cross. Instead,
filesystems are expected to handle this in ->iomap_begin() by limiting
the length of the mapping so that iomap doesn't have to worry about it.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220128233940.79464-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for efficiently
calling this light function from the block O_DIRECT handling.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock. In the most basic scenario, that buffer will not be
resident and it will be mapped to the same file. Accessing the buffer
will trigger a page fault, and gfs2 will deadlock trying to take the
same inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
5CmgeXto8FRUFjI=
=WeNd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.
Now that everybody is passing in zero, kill off the extra argument.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.
We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result. This allows the caller to deal with the page
fault and retry the remainder of the request.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.
For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an iocb is split into multiple bios we can't poll for both. So don't
bother to even try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch __iomap_dio_rw to use iomap_iter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The existing inline data support only works for cases where the entire
file is stored as inline data. For larger files, EROFS stores the
initial blocks separately and the remainder of the file ("file tail")
adjacent to the inode. Generalise inline data to allow reading the
inline file tail. Tails may not cross a page boundary in memory.
We currently have no filesystems that support tails and writing,
so that case is currently disabled (see iomap_write_begin_inline).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
For reads, use the better variant of checking for the need to call
filemap_write_and_wait_range() when doing O_DIRECT. This avoids falling
back to the slow path for IOCB_NOWAIT, if there are no pages to wait for
(or write out).
Link: https://lkml.kernel.org/r/20210224164455.1096727-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
WC22RGC2MA==
=os1E
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN if
the write is not a pure overwrite within EOF; this will be used to
reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for unaligned
direct writes if the circumstances permit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
=AR07
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The big change in this cycle is some new code to make it possible for
XFS to try unaligned directio overwrites without taking locks. If the
block is fully written and within EOF (i.e. doesn't require any
further fs intervention) then we can let the unlocked write proceed.
If not, we fall back to synchronizing direct writes.
Summary:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN
if the write is not a pure overwrite within EOF; this will be used
to reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for
unaligned direct writes if the circumstances permit"
* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reduce exclusive locking on unaligned dio
xfs: split the unaligned DIO write code out
xfs: improve the reflink_bounce_dio_write tracepoint
xfs: simplify the read/write tracepoints
xfs: remove the buffered I/O fallback assert
xfs: cleanup the read/write helper naming
xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
xfs: factor out a xfs_ilock_iocb helper
iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
iomap: pass a flags argument to iomap_dio_rw
iomap: rename the flags variable in __iomap_dio_rw
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>