The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].
The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:
acl_permission_check()
-> check_acl()
-> get_acl()
which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.
So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().
This is intended to be a non-functional change.
Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
s_inodes is superblock-specific resource, which should be
protected by sb's specific lock s_inode_list_lock.
Link: https://lore.kernel.org/r/TYCP286MB23238380DE3B74874E8D78ABCA299@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM
Fixes: 7d41963759 ("erofs: Support sharing cookies in the same domain")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Dawei Li <set_pte_at@outlook.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Partial decompression should be checked after updating length.
It's a new regression when introducing multi-reference pclusters.
Fixes: 2bfab9c0ed ("erofs: record the longest decompressed size in this round")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221014064915.8103-1-hsiangkao@linux.alibaba.com
If other duplicated copies exist in one decompression shot, should
leave the old page as is rather than replace it with the new duplicated
one. Otherwise, the following cold path to deal with duplicated copies
will use the invalid bvec. It impacts compressed data deduplication.
Also, shift the onlinepage EIO bit to avoid touching the signed bit.
Fixes: 267f2492c8 ("erofs: introduce multi-reference pclusters (fully-referenced)")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221012045056.13421-1-hsiangkao@linux.alibaba.com
Note that we are still accessing 'h_idata_size' and 'h_fragmentoff'
after calling erofs_put_metabuf(), that is not correct. Fix it.
Fixes: ab92184ff8 ("erofs: add on-disk compressed tail-packing inline support")
Fixes: b15b2e307c ("erofs: support on-disk compressed fragments data")
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20221005013528.62977-1-zbestahu@163.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
/a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
vQNj/iOBQA==
=R05e
-----END PGP SIGNATURE-----
Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- handle number of queue changes in the TCP and RDMA drivers
(Daniel Wagner)
- allow changing the number of queues in nvmet (Daniel Wagner)
- also consider host_iface when checking ip options (Daniel
Wagner)
- don't map pages which can't come from HIGHMEM (Fabio M. De
Francesco)
- avoid unnecessary flush bios in nvmet (Guixin Liu)
- shrink and better pack the nvme_iod structure (Keith Busch)
- add comment for unaligned "fake" nqn (Linjun Bao)
- print actual source IP address through sysfs "address" attr
(Martin Belanger)
- various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
- handle effects after freeing the request (Keith Busch)
- copy firmware_rev on each init (Keith Busch)
- restrict management ioctls to admin (Keith Busch)
- ensure subsystem reset is single threaded (Keith Busch)
- report the actual number of tagset maps in nvme-pci (Keith
Busch)
- small fabrics authentication fixups (Christoph Hellwig)
- add common code for tagset allocation and freeing (Christoph
Hellwig)
- stop using the request_queue in nvmet (Christoph Hellwig)
- set min_align_mask before calculating max_hw_sectors (Rishabh
Bhatnagar)
- send a rediscover uevent when a persistent discovery controller
reconnects (Sagi Grimberg)
- misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)
- MD pull request via Song:
- Various raid5 fix and clean up, by Logan Gunthorpe and David
Sloan.
- Raid10 performance optimization, by Yu Kuai.
- sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)
- IO scheduler switching quisce fix (Keith)
- s390/dasd block driver updates (Stefan)
- support for recovery for the ublk driver (ZiyangZhang)
- rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)
- blk-mq and null_blk map fixes (Bart)
- various bcache fixes (Coly, Jilin, Jules)
- nbd signal hang fix (Shigeru)
- block writeback throttling fix (Yu)
- optimize the passthrough mapping handling (me)
- prepare block cgroups to being gendisk based (Christoph)
- get rid of an old PSI hack in the block layer, moving it to the
callers instead where it belongs (Christoph)
- blk-throttle fixes and cleanups (Yu)
- misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
Miaohe, Bart, Coly, Gaosheng
* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
sbitmap: fix lockup while swapping
block: add rationale for not using blk_mq_plug() when applicable
block: adapt blk_mq_plug() to not plug for writes that require a zone lock
s390/dasd: use blk_mq_alloc_disk
blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
blk-mq: use quiesced elevator switch when reinitializing queues
block: replace blk_queue_nowait with bdev_nowait
nvme: remove nvme_ctrl_init_connect_q
nvme-loop: use the tagset alloc/free helpers
nvme-loop: store the generic nvme_ctrl in set->driver_data
nvme-loop: initialize sqsize later
nvme-fc: use the tagset alloc/free helpers
nvme-fc: store the generic nvme_ctrl in set->driver_data
nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
nvme-rdma: use the tagset alloc/free helpers
nvme-rdma: store the generic nvme_ctrl in set->driver_data
nvme-tcp: use the tagset alloc/free helpers
nvme-tcp: store the generic nvme_ctrl in set->driver_data
...
The name of this function looks not very accurate compared to it's
implementation and it's only a wrapper to erofs_read_metabuf(). So,
let's fold it directly instead.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220927032518.25266-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Due to deduplication for compressed data, pclusters can be partially
referenced with their prefixes.
Together with the user-space implementation, it enables EROFS
variable-length global compressed data deduplication with rolling
hash.
Link: https://lore.kernel.org/r/20220923014915.4362-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce on-disk compressed fragments data feature.
This approach adds a new field called `h_fragmentoff' in the per-file
compression header to indicate the fragment offset of each tail pcluster
or the whole file in the special packed inode.
Similar to ztailpacking, it will also find and record the 'headlcn'
of the tail pcluster when initializing per-inode zmap for making
follow-on requests more easy.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/YzHKxcFTlHGgXeH9@B-P7TQMD6M-0146.local
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Currently, uncompressed data is all handled in the shifted way, which
means we have to shift the whole on-disk plain pcluster to get the
logical data. However, since we are also using in-place I/O for
uncompressed data, data copy will be reduced a lot if pcluster is
recorded in the interlaced way as illustrated below:
_______________________________________________________________
| | | |_ tail part |_ head part _|
|<- blk0 ->| .. |<- blkn-2 ->|<- blkn-1 ->|
The logical data then becomes:
________________________________________________________
|_ head part _|_ blk0 _| .. |_ blkn-2 _|_ tail part _|
In addition, non-4k plain pclusters are also survived by the
interlaced way, which can be used for non-4k lclusters as well.
However, it's almost impossible to de-duplicate uncompressed data
in the interlaced way, therefore shifted uncompressed data is still
useful.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/8369112678604fdf4ef796626d59b1fdd0745a53.1663898962.git.huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The implementation of these two functions in fscache mode is almost the
same. Extract the same part as a generic helper to remove the code
duplication.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com>
Link: https://lore.kernel.org/r/20220922062414.20437-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
erofs uses an additional address space for compressed data read from disk
in addition to the one directly associated with the inode. Reading into
the lower address space is open coded using add_to_page_cache_lru instead
of using the filemap.c helper for page allocation micro-optimizations,
which means it is not covered by the MM PSI annotations for ->read_folio
and ->readahead, so add manual ones instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220915094200.139713-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce 'domain_id' mount option to enable shared domain sementics.
In which case, the related cookie is shared if two mountpoints in the
same domain have the same data blob. Users could specify the name of
domain by this mount option.
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220918043456.147-7-zhujia.zj@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Several erofs filesystems can belong to one domain, and data blobs can
be shared among these erofs filesystems of same domain.
Users could specify domain_id mount option to create or join into a
domain.
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220918110150.6338-1-zhujia.zj@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A new fscache-based shared domain mode is going to be introduced for
erofs. In which case, same data blobs in same domain will be shared
and reused to reduce on-disk space usage.
The implementation of sharing blobs will be introduced in subsequent
patches.
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220918043456.147-4-zhujia.zj@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
During stress testing with CONFIG_SMP disabled, KASAN reports as below:
==================================================================
BUG: KASAN: use-after-free in __mutex_lock+0xe5/0xc30
Read of size 8 at addr ffff8881094223f8 by task stress/7789
CPU: 0 PID: 7789 Comm: stress Not tainted 6.0.0-rc1-00002-g0d53d2e882f9 #3
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
..
__mutex_lock+0xe5/0xc30
..
z_erofs_do_read_page+0x8ce/0x1560
..
z_erofs_readahead+0x31c/0x580
..
Freed by task 7787
kasan_save_stack+0x1e/0x40
kasan_set_track+0x20/0x30
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0x10c/0x190
kmem_cache_free+0xed/0x380
rcu_core+0x3d5/0xc90
__do_softirq+0x12d/0x389
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x97/0xb0
call_rcu+0x3d/0x3f0
erofs_shrink_workstation+0x11f/0x210
erofs_shrink_scan+0xdc/0x170
shrink_slab.constprop.0+0x296/0x530
drop_slab+0x1c/0x70
drop_caches_sysctl_handler+0x70/0x80
proc_sys_call_handler+0x20a/0x2f0
vfs_write+0x555/0x6c0
ksys_write+0xbe/0x160
do_syscall_64+0x3b/0x90
The root cause is that erofs_workgroup_unfreeze() doesn't reset to
orig_val thus it causes a race that the pcluster reuses unexpectedly
before freeing.
Since UP platforms are quite rare now, such path becomes unnecessary.
Let's drop such specific-designed path directly instead.
Fixes: 73f5c66df3 ("staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220902045710.109530-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Actually, 'compressedlcs' stores compressed block count rather than
lcluster count. Therefore, the number of bits for shifting the count
should be 'LOG_BLOCK_SIZE' rather than 'lclusterbits' although current
lcluster size is 4K.
The value of 'm_plen' will be wrong once we enable the non 4K-sized
lcluster.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220812060150.8510-1-huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
If erofs_fscache_alloc_request fail and then goto out, it will return 0.
it should return a negative error code instead of 0.
Fixes: d435d53228 ("erofs: change to use asynchronous io for fscache readpage/readahead")
Signed-off-by: Sun Ke <sunke32@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220815034829.3940803-1-sunke32@huawei.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
erofs_readdir update ctx->pos after filling a batch of dentries
and it may cause dir/files duplication for NFS readdirplus which
depends on ctx->pos to fill dir correctly. So update ctx->pos for
every emitted dirent in erofs_fill_dentries to fix it.
Also fix the update of ctx->pos when the initial file position has
exceeded nameoff.
Fixes: 3e917cc305 ("erofs: make filesystem exportable")
Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220722082732.30935-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Let's introduce multi-reference pclusters at runtime. In details,
if one pcluster is requested by multiple extents at almost the same
time (even belong to different files), the longest extent will be
decompressed as representative and the other extents are actually
copied from the longest one in one round.
After this patch, fully-referenced extents can be correctly handled
and the full decoding check needs to be bypassed for
partial-referenced extents.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-17-hsiangkao@linux.alibaba.com
Currently, `pcl->length' records the longest decompressed length
as long as the pcluster itself isn't reclaimed. However, such
number is unneeded for the general cases since it doesn't indicate
the exact decompressed size in this round.
Instead, let's record the decompressed size for this round instead,
thus `pcl->nr_pages' can be completely dropped and pageofs_out is
also designed to be kept in sync with `pcl->length'.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-16-hsiangkao@linux.alibaba.com
For the most cases, small pclusters can be decompressed with page
arrays on stack.
Try to leave both (de)compressed_pages on stack if possible as before.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-14-hsiangkao@linux.alibaba.com
Let's introduce struct z_erofs_decompress_backend in order to pass
on the decompression backend context between helper functions more
easier and avoid too many arguments.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-13-hsiangkao@linux.alibaba.com
In order to introduce multi-reference pclusters for compressed data
deduplication, let's get rid of the global page array for now since
it needs to be re-designed then at least.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-12-hsiangkao@linux.alibaba.com
`enum z_erofs_collectmode' is really ambiguous, but I'm not quite
sure if there are better naming, basically it's used to judge whether
inplace I/O can be used due to the current status of pclusters in
the chain.
Rename it as `enum z_erofs_pclustermode' instead.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-11-hsiangkao@linux.alibaba.com
Since all decompressed offsets have been integrated to bvecs[], this
patch avoids all sub-indexes so that page->private only includes a
part count and an eio flag, thus in the future folio->private can have
the same meaning.
In addition, PG_error will not be used anymore after this patch and
we're heading to use page->private (later folio->private) and
page->mapping (later folio->mapping) only.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-9-hsiangkao@linux.alibaba.com
Convert compressed_pages[] to bufvec in order to avoid using
page->private to keep onlinepage_index (decompressed offset)
for inplace I/O pages.
In the future, we only rely on folio->private to keep a countdown
to unlock folios and set folio_uptodate.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-8-hsiangkao@linux.alibaba.com
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse compressed pages (or laterly,
compressed bvecs.)
BTW, since `compressed_bvecs' is too long as a part of the function
name, `in_bvecs' is used here instead.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-7-hsiangkao@linux.alibaba.com
Remove the old pagevec approach but keep z_erofs_page_type for now.
It will be reworked in the following commits as well.
Also rename Z_EROFS_NR_INLINE_PAGEVECS as Z_EROFS_INLINE_BVECS with
the new value 2 since it's actually enough to bootstrap.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-6-hsiangkao@linux.alibaba.com
For each pcluster, the total compressed buffers are determined in
advance, yet the number of decompressed buffers actually vary. Too
many decompressed pages can be recorded if one pcluster is highly
compressed or its pcluster size is large. That takes extra memory
footprints compared to uncompressed filesystems, especially a lot of
I/O in flight on low-ended devices.
Therefore, similar to inplace I/O, pagevec was introduced to reuse
page cache to store these pointers in the time-sharing way since
these pages are actually unused before decompressing.
In order to make it more flexable, a cleaner bufvec is used to
replace the old pagevec stuffs so that
- Decompressed offsets can be stored inline, thus it can be used
for the upcoming feature like compressed data deduplication.
It's calculated by `page_offset(page) - map->m_la';
- Towards supporting large folios for compressed inodes since
our final goal is to completely avoid page->private but use
folio->private only for all page cache pages.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-5-hsiangkao@linux.alibaba.com
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse decompressed pages (or laterly,
decompressed bvecs.)
BTW, since `decompressed_bvecs' is too long as a part of the function
name, `out_bvecs' is used instead.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-4-hsiangkao@linux.alibaba.com
Since commit 5c6dcc57e2 ("erofs: get rid of
`struct z_erofs_collector'"), these arguments can be dropped as well.
No logic changes.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-2-hsiangkao@linux.alibaba.com
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, vmap()s are avoided if physical addresses are
consecutive for decompressed buffers.
I observed that is very common for 4KiB pclusters since the
numbers of decompressed pages are almost 2 or 3.
However, such detection doesn't work for Highmem pages on
32-bit machines, let's fix it now.
Reported-by: Liu Jinbao <liujinbao1@xiaomi.com>
Fixes: 7fc45dbc93 ("staging: erofs: introduce generic decompression backend")
Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
When the user mounts the erofs second times, the decompression thread
may hung. The problem happens due to a sequence of steps like the
following:
1) Task A called z_erofs_load_lzma_config which obtain all of the node
from the z_erofs_lzma_head.
2) At this time, task B called the z_erofs_lzma_decompress and wanted to
get a node. But the z_erofs_lzma_head was empty, the Task B had to
sleep.
3) Task A release nodes and push nodes into the z_erofs_lzma_head. But
task B was still sleeping.
One example report when the hung happens:
task:kworker/u3:1 state:D stack:14384 pid: 86 ppid: 2 flags:0x00004000
Workqueue: erofs_unzipd z_erofs_decompressqueue_work
Call Trace:
<TASK>
__schedule+0x281/0x760
schedule+0x49/0xb0
z_erofs_lzma_decompress+0x4bc/0x580
? cpu_core_flags+0x10/0x10
z_erofs_decompress_pcluster+0x49b/0xba0
? __update_load_avg_se+0x2b0/0x330
? __update_load_avg_se+0x2b0/0x330
? update_load_avg+0x5f/0x690
? update_load_avg+0x5f/0x690
? set_next_entity+0xbd/0x110
? _raw_spin_unlock+0xd/0x20
z_erofs_decompress_queue.isra.0+0x2e/0x50
z_erofs_decompressqueue_work+0x30/0x60
process_one_work+0x1d3/0x3a0
worker_thread+0x45/0x3a0
? process_one_work+0x3a0/0x3a0
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
Signed-off-by: Yuwen Chen <chenyuwen1@meizu.com>
Fixes: 622ceaddb7 ("erofs: lzma compression support")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220626224041.4288-1-chenyuwen1@meizu.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Leave compressed inodes unsupported in fscache mode for now;
- Avoid crash when using tracepoint cachefiles_prep_read;
- Fix `backmost' behavior due to a recent cleanup;
- Update documentation for better description of recent new features;
- Several decompression cleanups w/o logical change.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYpeFXxEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBC9eAQC8YSePEG+YCGbmOCGadSuBsgU+OXzKGpCV
KxPyy3SmPQEAyNCDk11HoaYDRywS8TbMPntlyRfXvtEGSxbRe+5d1Qc=
=4RnO
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull more erofs updates from Gao Xiang:
"This is a follow-up to the main updates, including some fixes of
fscache mode related to compressed inodes and a cachefiles tracepoint.
There is also a patch to fix an unexpected decompression strategy
change due to a cleanup in the past. All the fixes are quite small.
Apart from these, documentation is also updated for a better
description of recent new features.
In addition, this has some trivial cleanups without actual code logic
changes, so I could have a more recent codebase to work on folios and
avoiding the PG_error page flag for the next cycle.
Summary:
- Leave compressed inodes unsupported in fscache mode for now
- Avoid crash when using tracepoint cachefiles_prep_read
- Fix `backmost' behavior due to a recent cleanup
- Update documentation for better description of recent new features
- Several decompression cleanups w/o logical change"
* tag 'erofs-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix 'backmost' member of z_erofs_decompress_frontend
erofs: simplify z_erofs_pcluster_readmore()
erofs: get rid of label `restart_now'
erofs: get rid of `struct z_erofs_collection'
erofs: update documentation
erofs: fix crash when enable tracepoint cachefiles_prep_read
erofs: leave compressed inodes unsupported in fscache mode for now
It was incompletely introduced for deduplication between different
logical extents backed with the same pcluster.
We will have a better in-memory representation in the next release
cycle for this, as well as partial memory folios support. So get rid
of it instead.
No logic changes.
Link: https://lore.kernel.org/r/20220529055425.226363-2-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
erofs over fscache doesn't support the compressed layout yet. It will
cause NULL crash if there are compressed inodes contained when working
in fscache mode.
So far in the erofs based container image distribution scenarios
(RAFS v6), the compressed RAFS v6 images are downloaded and then
decompressed on demand as an uncompressed erofs image. Then the erofs
image is mounted in fscache mode for containers to use. IOWs, currently
compressed data is decompressed on the userspace side instead and
uncompressed erofs images will be finally cached.
The fscache support for the compressed layout is still under
development and it will be used for runtime decompression feature.
Anyway, to avoid the potential crash, let's leave the compressed inodes
unsupported in fscache mode until we support it later.
Fixes: 1442b02b66 ("erofs: implement fscache-based data read for non-inline layout")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220526010344.118493-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first argument
like ->read_folio
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
=djLv
-----END PGP SIGNATURE-----
Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt
WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf
BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY
Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ
DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE
MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE
AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG
oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1
tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE
Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl
uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH
z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM=
=lv6P
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- subpage:
- support for PAGE_SIZE > 4K (previously only 64K)
- make it work with raid56
- repair super block num_devices automatically if it does not match
the number of device items
- defrag can convert inline extents to regular extents, up to now
inline files were skipped but the setting of mount option
max_inline could affect the decision logic
- zoned:
- minimal accepted zone size is explicitly set to 4MiB
- make zone reclaim less aggressive and don't reclaim if there are
enough free zones
- add per-profile sysfs tunable of the reclaim threshold
- allow automatic block group reclaim for non-zoned filesystems, with
sysfs tunables
- tree-checker: new check, compare extent buffer owner against owner
rootid
Performance:
- avoid blocking on space reservation when doing nowait direct io
writes (+7% throughput for reads and writes)
- NOCOW write throughput improvement due to refined locking (+3%)
- send: reduce pressure to page cache by dropping extent pages right
after they're processed
Core:
- convert all radix trees to xarray
- add iterators for b-tree node items
- support printk message index
- user bulk page allocation for extent buffers
- switch to bio_alloc API, use on-stack bios where convenient, other
bio cleanups
- use rw lock for block groups to favor concurrent reads
- simplify workques, don't allocate high priority threads for all
normal queues as we need only one
- refactor scrub, process chunks based on their constraints and
similarity
- allocate direct io structures on stack and pass around only
pointers, avoids allocation and reduces potential error handling
Fixes:
- fix count of reserved transaction items for various inode
operations
- fix deadlock between concurrent dio writes when low on free data
space
- fix a few cases when zones need to be finished
VFS, iomap:
- add helper to check if sb write has started (usable for assertions)
- new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"
* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
btrfs: zoned: introduce a minimal zone size 4M and reject mount
btrfs: allow defrag to convert inline extents to regular extents
btrfs: add "0x" prefix for unsupported optional features
btrfs: do not account twice for inode ref when reserving metadata units
btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
btrfs: send: avoid trashing the page cache
btrfs: send: keep the current inode open while processing it
btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
btrfs: move struct btrfs_dio_private to inode.c
btrfs: remove the disk_bytenr in struct btrfs_dio_private
btrfs: allocate dio_data on stack
iomap: add per-iomap_iter private data
iomap: allow the file system to provide a bio_set for direct I/O
btrfs: add a btrfs_dio_rw wrapper
btrfs: zoned: zone finish unused block group
btrfs: zoned: properly finish block group on metadata write
btrfs: zoned: finish block group when there are no more allocatable bytes left
btrfs: zoned: consolidate zone finish functions
btrfs: zoned: introduce btrfs_zoned_bg_is_full
btrfs: improve error reporting in lookup_inline_extent_backref
...
When "-o device" mount option is not specified, scan the device table
and instantiate the devices if there's any in the device table. In this
case, the tag field of each device slot uniquely specifies a device.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220512055601.106109-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Use asynchronous io to read data from fscache may greatly improve IO
bandwidth for sequential buffered read scenario.
Change erofs_fscache_read_folios to erofs_fscache_read_folios_async,
and read data from fscache asynchronously.
Make .readpage()/.readahead() to use this new helper.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-23-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
[ Gao Xiang: minor styling changes. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce 'fsid' mount option to enable on-demand read sementics, in
which case, erofs will be mounted from data blobs. Users could specify
the name of primary data blob by this mount option.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-22-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Tested-by: Zichen Tian <tianzichen@kuaishou.com>
Tested-by: Jia Zhu <zhujia.zj@bytedance.com>
Tested-by: Yan Song <yansong.ys@antgroup.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement the data plane of reading data from data blobs over fscache
for inline layout.
For the heading non-inline part, the data plane for non-inline layout is
reused, while only the tail packing part needs special handling.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-20-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement the data plane of reading data from data blobs over fscache
for non-inline layout.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-19-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Similar to the multi-device mode, erofs could be mounted from one
primary data blob (mandatory) and multiple extra data blobs (optional).
Register fscache context for each extra data blob.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-17-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Registers fscache context for primary data blob. Also move the
initialization of s_op and related fields forward, since anonymous
inode will be allocated under the super block when registering the
fscache context.
Something worth mentioning about the cleanup routine.
1. The fscache context will instantiate anonymous inodes under the super
block. Release these anonymous inodes when .put_super() is called, or
we'll get "VFS: Busy inodes after unmount." warning.
2. The fscache context is initialized prior to the root inode. If
.kill_sb() is called when mount failed, .put_super() won't be called
when root inode has not been initialized yet. Thus .kill_sb() shall
also contain the cleanup routine.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-16-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add erofs_fscache_read_folios() helper reading from fscache. It supports
on-demand read semantics. That is, it will make the backend prepare for
the data when cache miss. Once data ready, it will read from the cache.
This helper can then be used to implement .readpage()/.readahead() of
on-demand read semantics.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-15-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce one anonymous inode for data blobs so that erofs can cache
metadata directly within such anonymous inode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-14-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Introduce a context structure for managing data blobs, and helper
functions for initializing and cleaning up this context structure.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-13-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A new fscache based mode is going to be introduced for erofs, in which
case on-demand read semantics is implemented through fscache.
As the first step, register fscache volume for each erofs filesystem.
That means, data blobs can not be shared among erofs filesystems. In the
following iteration, we are going to introduce the domain semantics, in
which case several erofs filesystems can belong to one domain, and data
blobs can be shared among these erofs filesystems of one domain.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-12-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Until then erofs is exactly blockdev based filesystem.
A new fscache-based mode is going to be introduced for erofs to support
scenarios where on-demand read semantics is needed, e.g. container
image distribution. In this case, erofs could be mounted from data blobs
through fscache.
Add a helper checking which mode erofs works in, and twist the code in
preparation for the upcoming fscache mode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-11-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
... so that it can be used in the following introduced fscache mode.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-10-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch enables idmapped mounts for erofs, since all dedicated helpers
for this functionality existsm, so, in this patch we just pass down the
user_namespace argument from the VFS methods to the relevant helpers.
Simple idmap example on erofs image:
1. mkdir dir
2. touch dir/file
3. mkfs.erofs erofs.img dir
4. mount -t erofs -o loop erofs.img /mnt/erofs/
5. ls -ln /mnt/erofs/
total 0
-rw-rw-r-- 1 1000 1000 0 May 17 15:26 file
6. mount-idmapped --map-mount b:1000:1001:1 /mnt/erofs/ /mnt/scratch_erofs/
7. ls -ln /mnt/scratch_erofs/
total 0
-rw-rw-r-- 1 1001 1001 0 May 17 15:26 file
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Link: https://lore.kernel.org/r/20220517104103.3570721-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement export operations in order to make EROFS support accessing
inodes with filehandles so that it can be exported via NFS and used
by overlayfs.
Without this patch, 'exportfs -rv' will report:
exportfs: /root/erofs_mp does not support NFS export
Also tested with unionmount-testsuite and the testcase below passes now:
./run --ov --erofs --verify hard-link
For more details about the testcase, see:
https://github.com/amir73il/unionmount-testsuite/pull/6
Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220425040712.91685-1-hongnan.li@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
I got some KASAN report as below:
[ 46.959738] ==================================================================
[ 46.960430] BUG: KASAN: use-after-free in z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] Read of size 4074 at addr ffff8880300c2f8e by task fssum/188
...
[ 46.960430] Call Trace:
[ 46.960430] <TASK>
[ 46.960430] dump_stack_lvl+0x41/0x5e
[ 46.960430] print_report.cold+0xb2/0x6b7
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_report+0x8a/0x140
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_check_range+0x14d/0x1d0
[ 46.960430] memcpy+0x20/0x60
[ 46.960430] z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] z_erofs_decompress_pcluster+0xaae/0x1080
The root cause is that the tail pcluster won't be a complete filesystem
block anymore. So if ztailpacking is used, the second part of an
uncompressed tail pcluster may not be ``rq->pageofs_out``.
Fixes: ab749badf9 ("erofs: support unaligned data decompression")
Fixes: cecf864d3d ("erofs: support inline data decompression")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220512115833.24175-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Some comments haven't been useful anymore since the code updated.
Let's drop them instead.
Link: https://lore.kernel.org/r/20220506194612.117120-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The big pcluster feature has been merged for a year, it has been mostly
stable now.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220407050505.12683-1-huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a folio in erofs_managed_cache_release_folio(), but use of folios
should be pushed into erofs_try_to_free_cached_page().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Avoid using page structure directly for all uncompressed paths;
- Fix a double-free issue when sysfs initialization fails;
- Complete DAX description for erofs;
- Use mtime instead since there's no (easy) way for users to control
ctime;
- Several code cleanups.
-----BEGIN PGP SIGNATURE-----
iJIEABYIADoWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYjfdHxwcaHNpYW5na2Fv
QGxpbnV4LmFsaWJhYmEuY29tAAoJEDk3MdwfteYEvbYBANAWd+wSFLS3XDEzM3Nw
VzdMW7lfnrqog8HgqbcRHm9OAP9zx3KZaCh/frUA1OCKk/H0KD6UFShu6fXgBri+
DJMnCw==
=a5Ns
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we continue converting to use meta buffers for all
remaining uncompressed paths to prepare for the upcoming subpage,
folio and fscache features.
We also fixed a double-free issue when sysfs initialization fails,
which was reported by syzbot.
Besides, in order for the userspace to control per-file timestamp
easier, we now switch to record mtime instead of ctime with a
compatible feature marked. And there are also some code cleanups and
documentation update as usual.
Summary:
- Avoid using page structure directly for all uncompressed paths
- Fix a double-free issue when sysfs initialization fails
- Complete DAX description for erofs
- Use mtime instead since there's no (easy) way for users to control
ctime
- Several code cleanups"
* tag 'erofs-for-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: rename ctime to mtime
erofs: use meta buffers for inode lookup
erofs: use meta buffers for reading directories
fs: erofs: add sanity check for kobject in erofs_unregister_sysfs
erofs: refine managed inode stuffs
erofs: clean up z_erofs_extent_lookback
erofs: silence warnings related to impossible m_plen
Documentation/filesystem/dax: update DAX description on erofs
erofs: clean up preload_compressed_pages()
erofs: get rid of `struct z_erofs_collector'
erofs: use meta buffers for erofs_read_superblock()
EROFS images should inherit modification time rather than change time,
since users and host tooling have no easy way to control change time.
To reflect the new timestamp meaning, i_ctime and i_ctime_nsec are
renamed to i_mtime and i_mtime_nsec.
Link: https://lore.kernel.org/r/20220311041829.3109511-1-dvander@google.com # v1
Signed-off-by: David Anderson <dvander@google.com>
[ Gao Xiang: update document as well. ]
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220317114959.106787-1-hsiangkao@linux.alibaba.com # v2
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This converts the remaining inode lookup part by using metabuf in a
straight-forward way. Except that it doesn't use kmap_atomic()
anymore since we now have to maintain two metabufs together.
After this patch, all uncompressed paths are handled with metabuf
instead of page structure.
Link: https://lore.kernel.org/r/20220316012246.95131-2-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, directory inodes are directly handled with page cache
interfaces.
In order to support sub-page directory blocks and folios, let's
convert them into the latest metabuf infrastructure as well and
this patch addresses the readdir case first.
Link: https://lore.kernel.org/r/20220316012246.95131-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Syzkaller hit 'WARNING: kobject bug in erofs_unregister_sysfs'. This bug
is triggered by injecting fault in kobject_init_and_add of
erofs_unregister_sysfs.
Fix this by adding sanity check for kobject in erofs_unregister_sysfs
Note that I've tested the patch and the crash does not occur any more.
Link: https://lore.kernel.org/r/20220315132814.12332-1-dzm91@hust.edu.cn
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Fixes: 168e9a7620 ("erofs: add sysfs interface")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Avoid the unnecessary tail recursion since it can be converted into
a loop directly in order to prevent potential stack overflow.
It's a pretty straightforward conversion.
Link: https://lore.kernel.org/r/20220310182743.102365-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Dan reported two smatch warnings [1],
.. warn: should '1 << lclusterbits' be a 64 bit type?
.. warn: should 'm->compressedlcs << lclusterbits' be a 64 bit type?
In practice, m_plen cannot be more than 1MiB due to on-disk constraint
for the compression mode, so we're always safe here.
In order to make static analyzers happy and not report again, let's
silence them instead.
[1] https://lore.kernel.org/r/202203091002.lJVzsX6e-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220310173448.19962-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Rename preload_compressed_pages() as z_erofs_bind_cache()
since we're trying to prepare for adapting folios.
Also, add a comment for the gfp setting. No logic changes.
Link: https://lore.kernel.org/r/20220301194951.106227-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Avoid `struct z_erofs_collector' since there is another context
structure called "struct z_erofs_decompress_frontend".
No logic changes.
Link: https://lore.kernel.org/r/20220301194951.106227-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The only change is that, meta buffers read cache page without __GFP_FS
flag, which shall not matter.
Link: https://lore.kernel.org/r/20220209060108.43051-7-jefflexu@linux.alibaba.com
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs