This reverts commit 9dfbdafda3.
Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:
list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18
This is because work is requeued while it's still inside workqueue:
t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock
t3:
// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted
The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
When using the cleaner policy to decommission the cache, there is
never any writeback started from the cache as it is constantly delayed
due to normal I/O keeping the device busy. Meaning @idle=false was
always being passed to clean_target_met()
Fix this by adding a specific 'cleaner' flag that is set when the
cleaner policy is configured. This flag serves to always allow the
cleaner's writeback work to be queued until the cache is
decommissioned (even if the cache isn't idle).
Reported-by: David Jeffery <djeffery@redhat.com>
Fixes: b29d4986d0 ("dm cache: significant rework to leverage dm-bio-prison-v2")
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
__md_stop_writes() and __md_stop() will modify many fields that are
protected by 'reconfig_mutex', and all the callers will grab
'reconfig_mutex' except for md_stop().
Also, update md_stop() to make certain 'reconfig_mutex' is held using
lockdep_assert_held().
Fixes: 9d09e663d5 ("dm: raid456 basic support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There are four equivalent goto tags in raid_ctr(), clean them up to
use just one.
There is no functional change and this is preparation to fix
raid_ctr()'s unprotected md_stop().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In the error paths 'bad_stripe_cache' and 'bad_check_reshape',
'reconfig_mutex' is still held after raid_ctr() returns.
Fixes: 9dbd1aa3a8 ("dm raid: add reshaping support to the target")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
If the statement "recalc_tags = kvmalloc(recalc_tags_size, GFP_NOIO);"
fails, we call "vfree(recalc_buffer)" and we jump to the label "oom".
If the condition "recalc_sectors >= 1U << ic->sb->log2_sectors_per_block"
is false, we jump to the label "free_ret" and call "vfree(recalc_buffer)"
again, on an already released memory block.
Fix the bug by setting "recalc_buffer = NULL" after freeing it.
Fixes: da8b4fc1f6 ("dm integrity: only allocate recalculate buffer when needed")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ2IQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsQMEACQiUBw81tXvetYhz3P/4KrrjvUobgqMU0w
jtrxqMgPee9FbqCShpj76c+La5wu23DnlrCXoHZxFQuiQLnsX5xFV66NYVi+W1CN
k5MHP7f2e9V0T7qJ9UoHFRV1k22LF4X6T8njEZimxsm/uXfpav/knkhI7nUDnB1K
wxlu9akD2Bo/X9O2NTS+X6qjoawZ6rDWN15THMXlC45VzJPLmIcs07Ev+mvw21KE
XqasoZrxEO0S8dWxmJgJGqnRIOQptTS5U+0OPBZT8H220Qp/1q0pQHPw6iLXNrkc
w1a2W1Bge012gjJt7gCMkdDnZb76sKiyGuMbFME7DoRbLCQeaOtoSfmg7NoRI2gp
74TCSr7dPWZUVUy5Tmsy0DCv0552vIbnlQ69W6Xwx8YkplM3FPiMpWrQ5JWEHdvv
Zl84mLP6Yyo54JVuk9zi8q/2L0HfyfMDj4UM/mNs8hwmcUSbPO2TKdIWDaq8xPuS
Ed+D+kg6XFux8tLnCSDLNbaD5JE+ak9gTVhNdRa/zFE04o/OeidscKEqRSYTkdXL
2p34qtw5kEQocO4Pa3eUGO6KJCDTR36Rms5p6ZFybL4O2oZYrAbRi1TGDxaG2Hag
GCr2vaFbmz1zbGuMpFhLha5B7HeDLs+PHOn+B1iUNjEr9RC0EOHV7moJKqjxlnCh
4mBkK/Nlyg==
=kSeX
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
"Mostly items that came in a bit late for the initial pull request,
wanted to make sure they had the appropriate amount of linux-next soak
before going upstream.
Outside of stragglers, just generic fixes for either merge window
items, or longer standing bugs"
* tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits)
md/raid0: add discard support for the 'original' layout
nvme: disable controller on reset state failure
nvme: sync timeout work on failed reset
nvme: ensure unquiesce on teardown
cdrom/gdrom: Fix build error
nvme: improved uring polling
block: add request polling helper
nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
nvme: host: fix command name spelling
blk-sysfs: add a new attr_group for blk_mq
blk-iocost: move wbt_enable/disable_default() out of spinlock
blk-wbt: cleanup rwb_enabled() and wbt_disabled()
blk-wbt: remove dead code to handle wbt enable/disable with io inflight
blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
blk-mq: fix two misuses on RQF_USE_SCHED
blk-throttle: Fix io statistics for cgroup v1
bcache: Fix bcache device claiming
bcache: Alloc holder object before async registration
raid10: avoid spin_lock from fastpath from raid10_unplug()
md: fix 'delete_mutex' deadlock
...
API:
- Add linear akcipher/sig API.
- Add tfm cloning (hmac, cmac).
- Add statesize to crypto_ahash.
Algorithms:
- Allow only odd e and restrict value in FIPS mode for RSA.
- Replace LFSR with SHA3-256 in jitter.
- Add interface for gathering of raw entropy in jitter.
Drivers:
- Fix race on data_avail and actual data in hwrng/virtio.
- Add hash and HMAC support in starfive.
- Add RSA algo support in starfive.
- Add support for PCI device 0x156E in ccp.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmSdECcACgkQxycdCkmx
i6dW3g//a4DR6aaqYF8pU4svAzO56a0Plx3DVHUiJ4ygRB7xOzrQqXjCren6wY2a
LFuetwxebAhIAPsC79vI+3j8VAIlU9cNVqOxBIJHGY7wFO4m1AjqBjlealzqLrth
+nEIeUibqLeRw7imOO4adzSsKuSQgyU5rPtKWfrGqqI3RhuMgfWroCtmJ82jmq5l
uMZgB+aGGkzyXztxubHRPeJ3nOFEzo95SscpJ43lOjMcURRBhEa+20jXDhUGwpI7
9ycFV31AW+tfkIprAcliiIzZuwIbzlCkte6AxjAVsN100T/wh9JS1Y+uf1P0oZ9y
AUQQKyc8/QpSkzHZPTncat5P6zta28r8Q5neCvEEEGGuOE8Oc6kb0Os+RE5ANMU4
2A/zrKGOMIWeEWwXGc51xT3gxyl/Rn5wLw1pW7Lm4d5osGT9jiVXx/g66hKLpagJ
jegI6CqgvUajkRNi7JPVnSAauu0Ay8O6pU37/8gLOXNGVZBqONpRimk9qB05LNSF
QYzM2sgYv1tQEmjnG8jLhF5Z8brnqYTv2TZwBX43W10EDQNqUYUDff9Flean5xCb
+2mxJc81rgtUffnMXyYvQwKLhVKoLpeLR6Ts455S5aP06WAfoyEJyYTA/LHG24GX
H2HdS9g5y/K15k9yygMWaXgAx7O7MjM9gEa2VQakhnByj/eQM0s=
=rOLu
-----END PGP SIGNATURE-----
Merge tag 'v6.5-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add linear akcipher/sig API
- Add tfm cloning (hmac, cmac)
- Add statesize to crypto_ahash
Algorithms:
- Allow only odd e and restrict value in FIPS mode for RSA
- Replace LFSR with SHA3-256 in jitter
- Add interface for gathering of raw entropy in jitter
Drivers:
- Fix race on data_avail and actual data in hwrng/virtio
- Add hash and HMAC support in starfive
- Add RSA algo support in starfive
- Add support for PCI device 0x156E in ccp"
* tag 'v6.5-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (85 commits)
crypto: akcipher - Do not copy dst if it is NULL
crypto: sig - Fix verify call
crypto: akcipher - Set request tfm on sync path
crypto: sm2 - Provide sm2_compute_z_digest when sm2 is disabled
hwrng: imx-rngc - switch to DEFINE_SIMPLE_DEV_PM_OPS
hwrng: st - keep clock enabled while hwrng is registered
hwrng: st - support compile-testing
hwrng: imx-rngc - fix the timeout for init and self check
KEYS: asymmetric: Use new crypto interface without scatterlists
KEYS: asymmetric: Move sm2 code into x509_public_key
KEYS: Add forward declaration in asymmetric-parser.h
crypto: sig - Add interface for sign/verify
crypto: akcipher - Add sync interface without SG lists
crypto: cipher - On clone do crypto_mod_get()
crypto: api - Add __crypto_alloc_tfmgfp
crypto: api - Remove crypto_init_ops()
crypto: rsa - allow only odd e and restrict value in FIPS mode
crypto: geniv - Split geniv out of AEAD Kconfig option
crypto: algboss - Add missing dependency on RNG2
crypto: starfive - Add RSA algo support
...
We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.
More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.
The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.
I've been able to reliably reproduce the corruption with the following
test case:
1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)
Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.
Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.
I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.
Cc: NeilBrown <neilb@suse.de>
Cc: Song Liu <song@kernel.org>
Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com
- Fix DM crypt target's crypt_ctr_cipher_new return value on invalid
AEAD cipher.
- Fix DM flakey testing target's write bio corruption feature to
corrupt the data of a cloned bio instead of the original.
- Add random_read_corrupt and random_write_corrupt features to DM
flakey target.
- Fix ABBA deadlock in DM thin metadata by resetting associated bufio
client rather than destroying and recreating it.
- A couple other small DM thinp cleanups.
- Update DM core to support disabling block core IO stats accounting
and optimize away code that isn't needed if stats are disabled.
- Other small DM core cleanups.
- Improve DM integrity target to not require so much memory on 32 bit
systems. Also only allocate the recalculate buffer as needed (and
increasingly reduce its size on allocation failure).
- Update DM integrity to use %*ph for printing hexdump of a small
buffer. Also update DM integrity documentation.
- Various DM core ioctl interface hardening. Now more careful about
alignment of structures and processing of input passed to the kernel
from userspace. Also disallow the creation of DM devices named
"control", "." or ".."
- Eliminate GFP_NOIO workarounds for __vmalloc and kvmalloc in DM
core's ioctl and bufio code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmScyjAACgkQxSPxCi2d
A1pilwgAucNIAB6uN4ke4WZrMVxSFntUkqDTkCs2Ycw+W4Tf1Mtrj/4WeBzFdJaA
oZMK04LUGaFFXn+halsCDzB354yT9C7V/KfXW8pCM1c9BRz4e8272i2HSN4WwD5n
BU4gVaOV5BwxynfF3Z5siRraad1AwmdoRGGsqzVRESAKaObXU//1tnO42UhxRVhn
nzFqhIm0xRcLAd8xIBlMsZQGIloicdDP9wZdWzTEDspQiwR2dFRmH9bUF8OmsS+h
KwhtDty7aZO+4gJ1ccBImijzQCmbAo7dmFhDfoLXaA5Jt6UwTXMeBHm4aUPMnvQe
NVXoRZJodDemwM642Q/Tx1SpsX6QmA==
=4R7u
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Update DM crypt to allocate compound pages if possible
- Fix DM crypt target's crypt_ctr_cipher_new return value on invalid
AEAD cipher
- Fix DM flakey testing target's write bio corruption feature to
corrupt the data of a cloned bio instead of the original
- Add random_read_corrupt and random_write_corrupt features to DM
flakey target
- Fix ABBA deadlock in DM thin metadata by resetting associated bufio
client rather than destroying and recreating it
- A couple other small DM thinp cleanups
- Update DM core to support disabling block core IO stats accounting
and optimize away code that isn't needed if stats are disabled
- Other small DM core cleanups
- Improve DM integrity target to not require so much memory on 32 bit
systems. Also only allocate the recalculate buffer as needed (and
increasingly reduce its size on allocation failure)
- Update DM integrity to use %*ph for printing hexdump of a small
buffer. Also update DM integrity documentation
- Various DM core ioctl interface hardening. Now more careful about
alignment of structures and processing of input passed to the kernel
from userspace.
Also disallow the creation of DM devices named "control", "." or ".."
- Eliminate GFP_NOIO workarounds for __vmalloc and kvmalloc in DM
core's ioctl and bufio code
* tag 'for-6.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits)
dm: get rid of GFP_NOIO workarounds for __vmalloc and kvmalloc
dm integrity: scale down the recalculate buffer if memory allocation fails
dm integrity: only allocate recalculate buffer when needed
dm integrity: reduce vmalloc space footprint on 32-bit architectures
dm ioctl: Refuse to create device named "." or ".."
dm ioctl: Refuse to create device named "control"
dm ioctl: Avoid double-fetch of version
dm ioctl: structs and parameter strings must not overlap
dm ioctl: Avoid pointer arithmetic overflow
dm ioctl: Check dm_target_spec is sufficiently aligned
Documentation: dm-integrity: Document an example of how the tunables relate.
Documentation: dm-integrity: Document default values.
Documentation: dm-integrity: Document the meaning of "buffer".
Documentation: dm-integrity: Fix minor grammatical error.
dm integrity: Use %*ph for printing hexdump of a small buffer
dm thin: disable discards for thin-pool if no_discard_passdown
dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.h
dm: skip dm-stats work in alloc_io() unless needed
dm: avoid needless dm_io access if all IO accounting is disabled
dm: support turning off block-core's io stats accounting
...
Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
lpfc, qla2xxx). We have a couple of major core changes impacting
other systems: Command Duration Limits, which spills into block and
ATA and block level Persistent Reservation Operations, which touches
block, nvme, target and dm (both of which are added with merge commits
containing a cover letter explaining what's going on).
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZJ19cSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfZpAQCQBuWR
ELcOhsaG5KzO6xLWcH8mjsOoxffKvazZjTKXlAD5ATEv7++E250oKS3t+yfjae5I
Lc195MlDju85ItUQgfk=
=U9ik
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
lpfc, qla2xxx).
We have a couple of major core changes impacting other systems:
- Command Duration Limits, which spills into block and ATA
- block level Persistent Reservation Operations, which touches block,
nvme, target and dm
Both of these are added with merge commits containing a cover letter
explaining what's going on"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits)
scsi: core: Improve warning message in scsi_device_block()
scsi: core: Replace scsi_target_block() with scsi_block_targets()
scsi: core: Don't wait for quiesce in scsi_device_block()
scsi: core: Don't wait for quiesce in scsi_stop_queue()
scsi: core: Merge scsi_internal_device_block() and device_block()
scsi: sg: Increase number of devices
scsi: bsg: Increase number of devices
scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue
scsi: ufs: ufs-pci: Add support for Intel Arrow Lake
scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT
scsi: ufs: wb: Add explicit flush_threshold sysfs attribute
scsi: ufs: ufs-qcom: Switch to the new ICE API
scsi: ufs: dt-bindings: qcom: Add ICE phandle
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR
scsi: ufs: core: Remove dedicated hwq for dev command
scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command
scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes
...
* for-6.5/block-late:
blk-sysfs: add a new attr_group for blk_mq
blk-iocost: move wbt_enable/disable_default() out of spinlock
blk-wbt: cleanup rwb_enabled() and wbt_disabled()
blk-wbt: remove dead code to handle wbt enable/disable with io inflight
blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
blk-mq: fix two misuses on RQF_USE_SCHED
blk-throttle: Fix io statistics for cgroup v1
bcache: Fix bcache device claiming
bcache: Alloc holder object before async registration
raid10: avoid spin_lock from fastpath from raid10_unplug()
md: fix 'delete_mutex' deadlock
md: use mddev->external to select holder in export_rdev()
md/raid1-10: fix casting from randomized structure in raid1_submit_write()
md/raid10: fix the condition to call bio_end_io_acct()
Add a NULL check for the 'bdev' parameter of
dm_verity_loadpin_is_bdev_trusted(). The function is called
by loadpin_check(), which passes the block device that
corresponds to the super block of the file system from which
a file is being loaded. Generally a super_block structure has
an associated block device, however that is not always the
case (e.g. tmpfs).
Cc: stable@vger.kernel.org # v6.0+
Fixes: b6c1c5745c ("dm: Add verity helpers for LoadPin")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Link: https://lore.kernel.org/r/20230627202800.1.Id63f7f59536d20f1ab83e1abdc1fda1471c7d031@changeid
Signed-off-by: Kees Cook <keescook@chromium.org>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
cwANDmkL6QQK7yfeeg==
=s0j1
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"There are three areas of note:
A bunch of strlcpy()->strscpy() conversions ended up living in my tree
since they were either Acked by maintainers for me to carry, or got
ignored for multiple weeks (and were trivial changes).
The compiler option '-fstrict-flex-arrays=3' has been enabled
globally, and has been in -next for the entire devel cycle. This
changes compiler diagnostics (though mainly just -Warray-bounds which
is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
coverage. In other words, there are no new restrictions, just
potentially new warnings. Any new FORTIFY warnings we've seen have
been fixed (usually in their respective subsystem trees). For more
details, see commit df8fc4e934.
The under-development compiler attribute __counted_by has been added
so that we can start annotating flexible array members with their
associated structure member that tracks the count of flexible array
elements at run-time. It is possible (likely?) that the exact syntax
of the attribute will change before it is finalized, but GCC and Clang
are working together to sort it out. Any changes can be made to the
macro while we continue to add annotations.
As an example of that last case, I have a treewide commit waiting with
such annotations found via Coccinelle:
https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b
Also see commit dd06e72e68 for more details.
Summary:
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex
arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members"
* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
netfilter: ipset: Replace strlcpy with strscpy
uml: Replace strlcpy with strscpy
um: Use HOST_DIR for mrproper
kallsyms: Replace all non-returning strlcpy with strscpy
sh: Replace all non-returning strlcpy with strscpy
of/flattree: Replace all non-returning strlcpy with strscpy
sparc64: Replace all non-returning strlcpy with strscpy
Hexagon: Replace all non-returning strlcpy with strscpy
kobject: Use return value of strreplace()
lib/string_helpers: Change returned value of the strreplace()
jbd2: Avoid printing outside the boundary of the buffer
checkpatch: Check for 0-length and 1-element arrays
riscv/purgatory: Do not use fortified string functions
s390/purgatory: Do not use fortified string functions
x86/purgatory: Do not use fortified string functions
acpi: Replace struct acpi_table_slit 1-element array with flex-array
clocksource: Replace all non-returning strlcpy with strscpy
string: use __builtin_memcpy() in strlcpy/strlcat
staging: most: Replace all non-returning strlcpy with strscpy
drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
...
For historical reasons, unbound workqueues with max concurrency limit of 1
are considered ordered, even though the concurrency limit hasn't been
system-wide for a long time. This creates ambiguity around whether ordered
execution is actually required for correctness, which was actually confusing
for e.g. btrfs (btrfs updates are being routed through the btrfs tree).
There aren't that many users in the tree which use the combination and there
are pending improvements to unbound workqueue affinity handling which will
make inadvertent use of ordered workqueue a bigger loss. This pull request
clarifies the situation for most of them by updating the ones which require
ordered execution to use alloc_ordered_workqueue().
There are some conversions being routed through subsystem-specific trees and
likely a few stragglers. Once they're all converted, workqueue can trigger a
warning on unbound + @max_active==1 usages and eventually drop the implicit
ordered behavior.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJoKnA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGc5SAQDOtjML7Cx9AYzbY5+nYc0wTebRRTXGeOu7A3Xy
j50rVgEAjHgvHLIdmeYmVhCeHOSN4q7Wn5AOwaIqZalOhfLyKQk=
=hs79
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull ordered workqueue creation updates from Tejun Heo:
"For historical reasons, unbound workqueues with max concurrency limit
of 1 are considered ordered, even though the concurrency limit hasn't
been system-wide for a long time.
This creates ambiguity around whether ordered execution is actually
required for correctness, which was actually confusing for e.g. btrfs
(btrfs updates are being routed through the btrfs tree).
There aren't that many users in the tree which use the combination and
there are pending improvements to unbound workqueue affinity handling
which will make inadvertent use of ordered workqueue a bigger loss.
This clarifies the situation for most of them by updating the ones
which require ordered execution to use alloc_ordered_workqueue().
There are some conversions being routed through subsystem-specific
trees and likely a few stragglers. Once they're all converted,
workqueue can trigger a warning on unbound + @max_active==1 usages and
eventually drop the implicit ordered behavior"
* tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
rxrpc: Use alloc_ordered_workqueue() to create ordered workqueues
net: qrtr: Use alloc_ordered_workqueue() to create ordered workqueues
net: wwan: t7xx: Use alloc_ordered_workqueue() to create ordered workqueues
dm integrity: Use alloc_ordered_workqueue() to create ordered workqueues
media: amphion: Use alloc_ordered_workqueue() to create ordered workqueues
scsi: NCR5380: Use default @max_active for hostdata->work_q
media: coda: Use alloc_ordered_workqueue() to create ordered workqueues
crypto: octeontx2: Use alloc_ordered_workqueue() to create ordered workqueues
wifi: ath10/11/12k: Use alloc_ordered_workqueue() to create ordered workqueues
wifi: mwifiex: Use default @max_active for workqueues
wifi: iwlwifi: Use default @max_active for trans_pcie->rba.alloc_wq
xen/pvcalls: Use alloc_ordered_workqueue() to create ordered workqueues
virt: acrn: Use alloc_ordered_workqueue() to create ordered workqueues
net: octeontx2: Use alloc_ordered_workqueue() to create ordered workqueues
net: thunderx: Use alloc_ordered_workqueue() to create ordered workqueues
greybus: Use alloc_ordered_workqueue() to create ordered workqueues
powerpc, workqueue: Use alloc_ordered_workqueue() to create ordered workqueues
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
The cmpxchg128() family of functions is basically & functionally
the same as cmpxchg_double(), but with a saner interface: instead
of a 6-parameter horror that forced u128 - u64/u64-halves layout
details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128 types.
- Restructure the generated atomic headers, and add
kerneldoc comments for all of the generic atomic{,64,_long}_t
operations. Generated definitions are much cleaner now,
and come with documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering
when taking multiple locks of the same type. This gets rid of
one use of lockdep_set_novalidate_class() in the bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
variable shadowing generating garbage code on Clang on certain
ARM builds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
x9Nt+Tp0Ze4=
=DsYj
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
In the past, the function __vmalloc didn't respect the GFP flags - it
allocated memory with the provided gfp flags, but it allocated page tables
with GFP_KERNEL. This was fixed in commit 451769ebb7 ("mm/vmalloc:
alloc GFP_NO{FS,IO} for vmalloc") so the memalloc_noio_{save,restore}
workaround is no longer needed.
The function kvmalloc didn't like flags different from GFP_KERNEL. This
was fixed in commit a421ef3030 ("mm: allow !GFP_KERNEL allocations
for kvmalloc"), so kvmalloc can now be called with GFP_NOIO.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
If memory allocation fails, try to reduce the size of the recalculate
buffer and continue with that smaller buffer.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm-integrity preallocated 8MiB buffer for recalculating in the
constructor and freed it in the destructor. This wastes memory when
the user has many dm-integrity devices.
Fix dm-integrity so that the buffer is only allocated when
recalculation is in progress; allocate the buffer at the beginning of
integrity_recalc() and free it at the end.
Note that integrity_recalc() doesn't hold any locks when allocating
the buffer, so it shouldn't cause low-memory deadlock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
It was reported that dm-integrity runs out of vmalloc space on 32-bit
architectures. On x86, there is only 128MiB vmalloc space and dm-integrity
consumes it quickly because it has a 64MiB journal and 8MiB recalculate
buffer.
Fix this by reducing the size of the journal to 4MiB and the size of
the recalculate buffer to 1MiB, so that multiple dm-integrity devices
can be created and activated on 32-bit architectures.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
Commit 2736e8eeb0 ("block: use the holder as indication for exclusive
opens") introduced a change that blkdev_put() has to get exclusive
holder of the bdev as an argument. However it overlooked that
register_bdev() and register_cache() overwrite the bdev->bd_holder field
in the block device to point to the real owning object which was not
available at the time we called blkdev_get_by_path(). Messing with bdev
internals like this is a layering violation and it also causes
blkdev_put() to issue warning about mismatching holders.
Fix bcache to reopen the block device with appropriate holder once it is
available which also restores the behavior that multiple bcache caches
cannot claim the same device which was broken by commit 29499ab060
("bcache: don't pass a stack address to blkdev_get_by_path").
Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230622164658.12861-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocate holder object (cache or cached_dev) before offloading the
rest of the startup to async work. This will allow us to open the block
block device with proper holder.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230622164658.12861-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 0c0be98bbe ("md/raid10: prevent unnecessary calls to wake_up()
in fast path") missed one place, for example, with:
fio -direct=1 -rw=write/randwrite -iodepth=1 ...
Plug and unplug are called for each io, then wake_up() from raid10_unplug()
will cause lock contention as well.
Avoid this contention by using wake_up_barrier() instead of wake_up(),
where spin_lock is not held if waitqueue is empty.
Fio test script:
[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]
Test result with ramdisk raid10(By Ali):
Before this patch With this patch
READ IOPS=2033k IOPS=3642k
WRITE IOPS=871k IOPS=1561K
By the way, in this scenario, blk_plug_cb() will be allocated and freed
for each io, this seems need to be optimized as well.
Reported-and-tested-by: Ali Gholami Rudi <aligrudi@gmail.com>
Closes: https://lore.kernel.org/all/20231606122233@laper.mirepesht/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
Commit 3ce94ce5d0 ("md: fix duplicate filename for rdev") introduce a
new lock 'delete_mutex', and trigger a new deadlock:
t1: remove rdev t2: sysfs writer
rdev_attr_store rdev_attr_store
mddev_lock
state_store
md_kick_rdev_from_array
lock delete_mutex
list_add mddev->deleting
unlock delete_mutex
mddev_unlock
mddev_lock
...
lock delete_mutex
kobject_del
// wait for sysfs writers to be done
mddev_unlock
lock delete_mutex
// wait for delete_mutex, deadlock
'delete_mutex' is used to protect the list 'mddev->deleting', turns out
that this list can be protected by 'reconfig_mutex' directly, and this
lock can be removed.
Fix this problem by removing the lock, and use 'reconfig_mutex' to
protect the list. mddev_unlock() will move this list to a local list to
be handled after 'reconfig_mutex' is dropped.
Fixes: 3ce94ce5d0 ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com
Following build error triggered while build with clang version 17.0.0
with W=1(this can't be reporduced with gcc 13.1.0):
drivers/md/raid1-10.c:117:25: error: casting from randomized structure
pointer type 'struct block_device *' to 'struct md_rdev *'
117 | struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev;
| ^
Fix this by casting 'bio->bi_bdev' to 'void *', as it used to be.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306142042.fmjfmTF8-lkp@intel.com/
Fixes: 8295efbe68 ("md/raid1-10: factor out a helper to submit normal write")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230616012136.3047071-1-yukuai1@huaweicloud.com
/sys/block/[device]/queue/iostats is used to control whether to count io
stat. Write 0 to it will clear queue_flags QUEUE_FLAG_IO_STAT which means
iostats is disabled. If we disable iostats and later endable it, the io
issued during this period will be counted incorrectly, inflight will be
decreased to -1.
//T1 set iostats
echo 0 > /sys/block/md0/queue/iostats
clear QUEUE_FLAG_IO_STAT
//T2 issue io
if (QUEUE_FLAG_IO_STAT) -> false
bio_start_io_acct
inflight++
echo 1 > /sys/block/md0/queue/iostats
set QUEUE_FLAG_IO_STAT
//T3 io end
if (QUEUE_FLAG_IO_STAT) -> true
bio_end_io_acct
inflight-- -> -1
Also, if iostats is enabled while issuing io but disabled while io end,
inflight will never be decreased.
Fix it by checking start_time when io end. If start_time is not 0, call
bio_end_io_acct().
Fixes: 528bc2cf2f ("md/raid10: enable io accounting")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230609094320.2397604-1-linan666@huaweicloud.com
Using either of these is going to greatly confuse userspace, as they are
not valid symlink names and so creating the usual /dev/mapper/NAME
symlink will not be possible. As creating a device with either of these
names is almost certainly a userspace bug, just error out.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Typical userspace setups create a symlink under /dev/mapper with the
name of the device, but /dev/mapper/control is reserved for DM's control
device. Therefore, trying to create such a device is almost certain to
be a userspace bug.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The version is fetched once in check_version(), which then does some
validation and then overwrites the version in userspace with the API
version supported by the kernel. copy_params() then fetches the version
from userspace *again*, and this time no validation is done. The result
is that the kernel's version number is completely controllable by
userspace, provided that userspace can win a race condition.
Fix this flaw by not copying the version back to the kernel the second
time. This is not exploitable as the version is not further used in the
kernel. However, it could become a problem if future patches start
relying on the version field.
Cc: stable@vger.kernel.org
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The NUL terminator for each target parameter string must precede the
following 'struct dm_target_spec'. Otherwise, dm_split_args() might
corrupt this struct. Furthermore, the first 'struct dm_target_spec'
must come after the 'struct dm_ioctl', as if it overlaps too much
dm_split_args() could corrupt the 'struct dm_ioctl'.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Especially on 32-bit systems, it is possible for the pointer
arithmetic to overflow and cause a userspace pointer to be
dereferenced in the kernel.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Otherwise subsequent code, if given malformed input, could dereference
a misaligned 'struct dm_target_spec *'.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> # use %zu
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The kernel already has a helper to print a hexdump of a small
buffer via pointer extension. Use that instead of open coded
variant.
In long term it helps to kill pr_cont() or at least narrow down
its use.
Note, the format is slightly changed, i.e. the trailing space is
always printed. Also the IV dump is limited by 64 bytes which seems
fine.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
This includes a wholesale reversion of the post-6.4 series "make slab shrink
lockless". After input from Dave Chinner it has been decided that we
should go a different way. Thread starts at
https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJH/qAAKCRDdBJ7gKXxA
jq7uAP9AtDGHfvOuW5jlHdYfpUBnbfuQDKjiik71UuIxyhtwQQEAqpOBv7UDuhHj
NbNIGTIi/xM5vkpjV6CBo9ymR7qTKwo=
=uGuc
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"19 hotfixes. 8 of these are cc:stable.
This includes a wholesale reversion of the post-6.4 series 'make slab
shrink lockless'. After input from Dave Chinner it has been decided
that we should go a different way [1]"
Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]
* tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
selftests/mm: fix cross compilation with LLVM
mailmap: add entries for Ben Dooks
nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
Revert "mm: vmscan: make global slab shrink lockless"
Revert "mm: vmscan: make memcg slab shrink lockless"
Revert "mm: vmscan: add shrinker_srcu_generation"
Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
nilfs2: fix buffer corruption due to concurrent device reads
scripts/gdb: fix SB_* constants parsing
scripts: fix the gfp flags header path in gfp-translate
udmabuf: revert 'Add support for mapping hugepages (v4)'
mm/khugepaged: fix iteration in collapse_file
memfd: check for non-NULL file_seals in memfd_create() syscall
mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
mm/mprotect: fix do_mprotect_pkey() limit check
writeback: fix dereferencing NULL mapping->host on writeback_page_template
Patch series "revert shrinker_srcu related changes".
This patch (of 7):
This reverts commit cf2e309ebc.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_mutex back to shrinker_rwsem first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev
Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Also rename disable_passdown_if_not_supported to
disable_discard_passdown_if_not_supported.
And fold passdown_enabled() into only caller.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Update dm_io_acct() to eliminate most dm_io struct accesses if both
block core's IO stats and dm-stats are disabled.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit bc58ba9468 ("block: add sysfs file for controlling io stats
accounting") allowed users to turn off disk stat accounting completely
by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag
is neither set nor checked: so block-core's io stats are continuously
counted and cannot be turned off.
Add support for turning off block-core's io stats accounting for dm.
Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT
is set when an io starts, record the need for block core's io stats by
setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled
in the middle of the io.
DM statistics (dm-stats) is independent of block-core's io stats and
remains unchanged.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.
It is less verbose and it improves the semantic.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
As described in commit 8111964f1b ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be
triggered because shrinker_rwsem currently needs to held by
dm_pool_abort_metadata() as a side-effect of thin-pool metadata
operation failure.
The following three problem scenarios have been noticed:
1) Described by commit 8111964f1b ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata")
2) shrinker_rwsem and throttle->lock
P1(drop cache) P2(kworker)
drop_caches_sysctl_handler
drop_slab
shrink_slab
down_read(&shrinker_rwsem) - LOCK A
do_shrink_slab
super_cache_scan
prune_icache_sb
dispose_list
evict
ext4_evict_inode
ext4_clear_inode
ext4_discard_preallocations
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_wait_block_bitmap
__ext4_error
ext4_handle_error
ext4_commit_super
...
dm_submit_bio
do_worker
throttle_work_update
down_write(&t->lock) -- LOCK B
process_deferred_bios
commit
metadata_operation_failed
dm_pool_abort_metadata
dm_block_manager_create
dm_bufio_client_create
register_shrinker
down_write(&shrinker_rwsem)
-- LOCK A
thin_map
thin_bio_map
thin_defer_bio_with_throttle
throttle_lock
down_read(&t->lock) - LOCK B
3) shrinker_rwsem and wait_on_buffer
P1(drop cache) P2(kworker)
drop_caches_sysctl_handler
drop_slab
shrink_slab
down_read(&shrinker_rwsem) - LOCK A
do_shrink_slab
...
ext4_wait_block_bitmap
__ext4_error
ext4_handle_error
jbd2_journal_abort
jbd2_journal_update_sb_errno
jbd2_write_superblock
submit_bh
// LOCK B
// RELEASE B
do_worker
throttle_work_update
down_write(&t->lock) - LOCK B
process_deferred_bios
process_bio
commit
metadata_operation_failed
dm_pool_abort_metadata
dm_block_manager_create
dm_bufio_client_create
register_shrinker
register_shrinker_prepared
down_write(&shrinker_rwsem) - LOCK A
bio_endio
wait_on_buffer
__wait_on_buffer
Fix these by resetting dm_bufio_client without holding shrinker_rwsem.
Fixes: 8111964f1b ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
If the user specifies invalid AEAD cipher, dm-crypt should return the
error returned from crypt_ctr_auth_spec, not -ENOMEM.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Removes assumptions about what might follow the discard setup code
(previously the code would return early if discards not enabled).
Makes it possible to add more capabilites to the end of each .io_hints
method (which is the natural thing to do when adding new features).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The random_read_corrupt and random_write_corrupt options corrupt a
random byte in a bio with the provided probability. The corruption
only happens in the "down" interval.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm-flakey has an option to corrupt write bios. It corrupts the memory that
is being written. This can cause system crashes or security bugs - for
example, if the user writes a shared library code with O_DIRECT flag to a
dm-flakey device, it corrupts the memory for all users that have the
shared library mapped.
Fix this bug by cloning the bio and corrupting the clone rather than
the original.
Also drop the test for ZERO_PAGE(0) - it can't happen because we write
the cloned pages.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
It was reported that allocating pages for the write buffer in dm-crypt
causes measurable overhead [1].
Change dm-crypt to allocate compound pages if they are available. If
not, fall back to the mempool.
[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
merge; where DM core was splitting large discards every 128K
(max_sectors_kb) rather than every 64M (discard_max_bytes).
- Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
between do_mount and dm's do_suspend (in addition to the earlier
fix's do_mount race with dm's do_resume).
- Fix DM thin metadata operations to first check if the thin-pool is
in "fail_io" mode; otherwise UAF can occur.
- Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
from bio_alloc).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmSLTh0ACgkQxSPxCi2d
A1rmqggAildPKBjT8nqZmU86lpsy60E03OwvBnGPkMF5pjkOUmTjkb5EWVSAmeuO
ojj0pWlC+1ZvVkiDfkWxt0NL/4ETD4q+5oy1ARBcOawPX6bj0eXLoBr6m10b+KOb
mKAoXgYrESEzQ2qPBe4a4Lj3zIBXzXpMpW9TtF23z4HnDpnwpED5xNPWBgiWc3O/
/6MF1ASLp0DWldoL+gmIp9hEzyQzbzgM4uBOGC4UAYk3U1I55qwX6bWDZ9cQNGMh
AqCSrphuKHvbsb31yb1X3hB3g1XbAeSvvcizgFY0g9ZpncddKm5gx0BWVDO7qGBG
UxLIec19kQ2CIEx/QJZIhEjneLlJ/g==
=mME4
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix DM thinp discard performance regression introduced during this
merge window where DM core was splitting large discards every 128K
(max_sectors_kb) rather than every 64M (discard_max_bytes).
- Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
between do_mount and dm's do_suspend (in addition to the earlier
fix's do_mount race with dm's do_resume).
- Fix DM thin metadata operations to first check if the thin-pool is in
"fail_io" mode; otherwise UAF can occur.
- Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
from bio_alloc).
* tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: use op specific max_sectors when splitting abnormal io
dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
dm thin metadata: check fail_io before using data_sm
dm: don't lock fs when the map is NULL during suspend or resume
Split abnormal IO in terms of the corresponding operation specific
max_sectors (max_discard_sectors, max_secure_erase_sectors or
max_write_zeroes_sectors).
This fixes a significant dm-thinp discard performance regression that
was introduced with commit e2dd8aca2d ("dm bio prison v1: improve
concurrent IO performance"). Relative to discard: max_discard_sectors
is used instead of max_sectors; which fixes excessive discard splitting
(e.g. max_sectors=128K vs max_discard_sectors=64M).
Tested by discarding an 1 Petabyte dm-thin device:
lvcreate -V 1125899906842624B -T test/pool -n thin
time blkdiscard /dev/test/thin
Before this fix (splitting discards every 128K): ~116m
After this fix (splitting discards every 64M) : 0m33.460s
Reported-by: Zorro Lang <zlang@redhat.com>
Fixes: 06961c487a ("dm: split discards further if target sets max_discard_granularity")
Requires: 13f6facf3f ("dm: allow targets to require splitting WRITE_ZEROES and SECURE_ERASE")
Fixes: e2dd8aca2d ("dm bio prison v1: improve concurrent IO performance")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
issue_discard() passes GFP_NOWAIT to __blkdev_issue_discard() despite
its code assuming bio_alloc() always succeeds.
Commit 3dba53a958 ("dm thin: use __blkdev_issue_discard for async
discard support") clearly shows where things went bad:
Before commit 3dba53a958, dm-thin.c's open-coded
__blkdev_issue_discard_async() properly handled using GFP_NOWAIT.
Unfortunately __blkdev_issue_discard() doesn't and it was missed
during review.
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Must check pmd->fail_io before using pmd->data_sm since
pmd->data_sm may be destroyed by other processes.
P1(kworker) P2(message)
do_worker
process_prepared
process_prepared_discard_passdown_pt2
dm_pool_dec_data_range
pool_message
commit
dm_pool_commit_metadata
↓
// commit failed
metadata_operation_failed
abort_transaction
dm_pool_abort_metadata
__open_or_format_metadata
↓
dm_sm_disk_open
↓
// open failed
// pmd->data_sm is NULL
dm_sm_dec_blocks
↓
// try to access pmd->data_sm --> UAF
As shown above, if dm_pool_commit_metadata() and
dm_pool_abort_metadata() fail in pool_message process, kworker may
trigger UAF.
Fixes: be500ed721 ("dm space maps: improve performance with inc/dec on ranges of blocks")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
As described in commit 38d11da522 ("dm: don't lock fs when the map is
NULL in process of resume"), a deadlock may be triggered between
do_resume() and do_mount().
This commit preserves the fix from commit 38d11da522 but moves it to
where it also serves to fix a similar deadlock between do_suspend()
and do_mount(). It does so, if the active map is NULL, by clearing
DM_SUSPEND_LOCKFS_FLAG in dm_suspend() which is called by both
do_suspend() and do_resume().
Fixes: 38d11da522 ("dm: don't lock fs when the map is NULL in process of resume")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."
crash> struct list_head 0xffff9c801bc01210
struct list_head {
next = 0xffffae586d8afe68,
prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
next = 0x0,
prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68
[230469.019492] Call Trace:
[230469.032041] prepare_to_wait+0x8a/0xb0
[230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533] mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788] mca_alloc+0x2ae/0x450 [escache]
[230469.080790] bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382] ? finish_wait+0x80/0x80
[230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259] kthread+0x112/0x130
[230469.138448] ? kthread_flush_work_fn+0x10/0x10
[230469.149477] ret_from_fork+0x35/0x40
bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.
We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.
Fixes: 8e7102273f ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45fc5 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.
Fix it by initializing the return value in __bch_btree_node_alloc.
Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to the previous fix of __bch_btree_node_alloc, the return value will
never be a NULL pointer. So IS_ERR is enough to handle the failure
situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check.
Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The cache_readaheads stat counter is not used anymore and should be
removed.
Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:
Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:
[root@fedora ~]# cat /sys/block/md0/inflight
0 4474191
Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a831 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.
On the one hand, the deadlock won't exist after commit a214b949d8
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.
In order to limit the number of plugged bio, commit 874807a831
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
If bitmap is enabled, bitmap must update before submitting write io, this
is why unplug callback must move these io to 'conf->pending_io_list' if
'current->bio_list' is not empty, which will suffer performance
degradation.
A new helper md_bitmap_unplug_async() is introduced to submit bitmap io
in a kworker, so that submit bitmap io in raid10_unplug() doesn't require
that 'current->bio_list' is empty.
This patch prepare to limit the number of plugged bio.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
Commit 6cce3b23f6 ("[PATCH] md: write intent bitmap support for raid10")
add bitmap support, and it changed that write io is submitted through
daemon thread because bitmap need to be updated before write io. And
later, plug is used to fix performance regression because all the write io
will go to demon thread, which means io can't be issued concurrently.
However, if bitmap is not enabled, the write io should not go to daemon
thread in the first place, and plug is not needed as well.
Fixes: 6cce3b23f6 ("[PATCH] md: write intent bitmap support for raid10")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com
There are multiple places to do the same thing, factor out a helper to
prevent redundant code, and the helper will be used in following patch
as well.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
Currently, there is no limit for raid1/raid10 plugged bio. While flushing
writes, raid1 has cond_resched() while raid10 doesn't, and too many
writes can cause soft lockup.
Follow up soft lockup can be triggered easily with writeback test for
raid10 with ramdisks:
watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293]
Call Trace:
<TASK>
call_rcu+0x16/0x20
put_object+0x41/0x80
__delete_object+0x50/0x90
delete_object_full+0x2b/0x40
kmemleak_free+0x46/0xa0
slab_free_freelist_hook.constprop.0+0xed/0x1a0
kmem_cache_free+0xfd/0x300
mempool_free_slab+0x1f/0x30
mempool_free+0x3a/0x100
bio_free+0x59/0x80
bio_put+0xcf/0x2c0
free_r10bio+0xbf/0xf0
raid_end_bio_io+0x78/0xb0
one_write_done+0x8a/0xa0
raid10_end_write_request+0x1b4/0x430
bio_endio+0x175/0x320
brd_submit_bio+0x3b9/0x9b7 [brd]
__submit_bio+0x69/0xe0
submit_bio_noacct_nocheck+0x1e6/0x5a0
submit_bio_noacct+0x38c/0x7e0
flush_pending_writes+0xf0/0x240
raid10d+0xac/0x1ed0
Fix the problem by adding cond_resched() to raid10 like what raid1 did.
Note that unlimited plugged bio still need to be optimized, for example,
in the case of lots of dirty pages writeback, this will take lots of
memory and io will spend a long time in plug, hence io latency is bad.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com
When removing a disk with replacement, the replacement will be used to
replace rdev. During this process, there is a brief window in which both
rdev and replacement are read as NULL in raid10_write_request(). This
will result in io not being submitted but it should be.
//remove //write
raid10_remove_disk raid10_write_request
mirror->rdev = NULL
read rdev -> NULL
mirror->rdev = mirror->replacement
mirror->replacement = NULL
read replacement -> NULL
Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.
Fixes: 475b0321a4 ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com
In raid10_sync_request(), if data cannot be read from any disk for
recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
return 'max_sector', indicating that the recovery has been completed.
However, the recovery is just aborted and the data remains inconsistent.
Fix it by setting mirror->recovery_disabled, which will prevent the spare
disk from being added to this mirror. The same issue also exists during
resync, it will be fixed afterwards.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-2-linan666@huaweicloud.com
Commit 1a855a0606 ("md: fix bug with re-adding of partially recovered
device.") only add device which is set to In_sync. But it let devices
without metadata cannot be added when they should be.
Commit bf572541ab ("md: fix regression with re-adding devices to arrays
with no metadata") fix the above issue, it set device without metadata to
In_sync when add new disk.
However, after commit f466722ca6 ("md: Change handling of save_raid_disk
and metadata update during recovery.") deletes changes of the first patch,
setting In_sync for devcie without metadata is meanless because the flag
will be cleared soon and will not be used during this period. Clean it up.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
When add a new disk to raid10, it will traverse conf->mirror from start
and find one of the following mirror to add:
1. mirror->rdev is set to WantReplacement and it have no replacement,
set new disk to mirror->replacement.
2. no mirror->rdev, set new disk to mirror->rdev.
There is a array as below (sda is set to WantReplacement):
Number Major Minor RaidDevice State
0 8 0 0 active sync set-A /dev/sda
- 0 0 1 removed
2 8 32 2 active sync set-A /dev/sdc
3 8 48 3 active sync set-B /dev/sdd
Use 'mdadm --add' to add a new disk to this array, the new disk will
become sda's replacement instead of add to removed position, which is
confusing for users. Meanwhile, after new disk recovery success, sda
will be set to Faulty.
Prioritize adding disk to 'removed' mirror is a better choice. In the
above scenario, the behavior is the same as before, except sda will not
be deleted. Before other disks are added, continued use sda is more
reliable.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527092007.3008856-1-linan666@huaweicloud.com
'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and
inc mrdev->nr_pending is unreasonable if don't need recovery. Replace
'need_recover' with 'mrdev', and only inc nr_pending when needed.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-3-linan666@huaweicloud.com
There are two check of 'mreplace' in raid10_sync_request(). In the first
check, 'need_replace' will be set and 'mreplace' will be used later if
no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be
set to NULL if it is Faulty, but 'need_replace' will not be changed
accordingly. null-ptr-deref occurs if Faulty is set between two check.
Fix it by merging two checks into one. And replace 'need_replace' with
'mreplace' because their values are always the same.
Fixes: ee37d7314a ("md/raid10: Fix raid10 replace hang when new added disk faulty")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING
is not enough to tell recovery is in progress. Also check recovery_cp
before starting reshape.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:
1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().
Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.
Fix this problem by protecting all md_thread with rcu.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
Register/unregister 'mddev->thread' are both under 'reconfig_mutex',
however, some context didn't hold the mutex to access mddev->thread,
which can cause null-ptr-deference:
1) md_bitmap_daemon_work() can be called from md_check_recovery() where
'reconfig_mutex' is not held, deference 'mddev->thread' might cause
null-ptr-deference, because md_unregister_thread() reset the pointer
before stopping the thread.
2) timeout_store() access 'mddev->thread' multiple times,
null-ptr-deference can be triggered if 'mddev->thread' is reset in the
middle.
This patch factor out a helper to set timeout, the new helper always
check if 'mddev->thread' is null first, so that problem 1 can be fixed.
Now that this helper only access 'mddev->thread' once, but it's possible
that 'mddev->thread' can be freed while this helper is still in progress,
hence the problem is not fixed yet. Follow up patches will fix this by
protecting md_thread with rcu.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-5-yukuai1@huaweicloud.com
md_wakeup_thread() can handle the case that pass in md_thread is NULL,
the only difference is that md_wakeup_thread() will be called when
current timeout is 'MAX_SCHEDULE_TIMEOUT', this should not matter
because timeout_store() is not hot path, and the daemon process is
woke up more than demand from other context already.
Prepare to factor out a helper to set timeout.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-4-yukuai1@huaweicloud.com
md_wakeup_thread() can't wakeup md_thread->tsk if md_thread->run is
still in progress, and in some cases md_thread->tsk need to be woke up
directly, like md_set_readonly() and do_md_stop().
Commit 9dfbdafda3 ("md: unlock mddev before reap sync_thread in
action_store") introduce a new scenario where unregister sync_thread is
not protected by 'reconfig_mutex', this can cause null-ptr-deference in
theroy:
t1: md_set_readonly t2: action_store
md_unregister_thread
// 'reconfig_mutex' is not held
// 'reconfig_mutex' is held by caller
if (mddev->sync_thread)
thread = *threadp
*threadp = NULL
wake_up_process(mddev->sync_thread->tsk)
// null-ptr-deference
Fix this problem by factoring out a helper to wake up md_thread directly,
so that 'sync_thread' won't be accessed multiple times from the reader
side. This helper also prepare to protect md_thread with rcu.
Noted that later patches is going to fix that unregister sync_thread is
not protected by 'reconfig_mutex' from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-2-yukuai1@huaweicloud.com
Commit 5792a2856a ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.
Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():
1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);
So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.
flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.
sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.
A new mdadm regression test is proposed as well([1]).
[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
There is no input check when echo md/max_read_errors and overflow might
occur. Add check of input number.
Fixes: 1e50915fe0 ("raid: improve MD/raid10 handling of correctable read errors.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com
There is no input check when echo md/safe_mode_delay in safe_delay_store().
And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by
checking overflow in safe_delay_store() and use unsigned long conversion in
safe_delay_show().
Fixes: 72e02075a3 ("md: factor out parsing of fixed-point numbers")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com
If reshape is in progress and io across reshape_position is issued, such
io will wait for reshape to make progress(see details in the case that
make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY).
It has been reported several times that if system reboot while growing
raid5 to raid6, array assemble will hang infinitely([1, 2]). This is
because following deadlock is triggered:
1) a normal io is waiting for reshape to progress, this io can be from
system-udevd or mdadm.
2) while assemble, mdadm tries to suspend the array, hence
'reconfig_mutex' is held and mddev_suspend() must wait for normal io
to be done.
3) daemon thread can't start reshape because 'reconfig_mutex' can't be
held.
1) and 3) is unbreakable because they're foundation design. In order to
break 2), following is possible solutions that I can think of:
a) Let mddev_suspend() fail is not a good option, because this will
break many scenarios since mddev_suspend() doesn't fail before.
b) Fail the io that is waiting for reshape to make progress from
mddev_suspend().
c) Return false for the io that is waiting for reshape to make
progress from raid5_make_request(), and these io will wait for
suspend to be done in md_handle_request(), where 'active_io' is
not grabbed.
c) sounds better than b), however, b) is used because it's easy and
straightforward, and it's verified that mdadm can assemble in this case.
On the other hand, c) breaks the logic that mddev_suspend() will wait
for submitted io to be completely handled.
Fix the problem by checking reshape in mddev_suspend(), if reshape can't
make progress and there are still some io waiting for reshape, fail
those io.
[1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/
[2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/
Reported-by: Jove <jovetoo@gmail.com>
Reported-by: David Gilmour <dgilmour76@gmail.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
If reshape is interrupted(for example, echo frozen to sync_action), then
rdev replacement can be set. It's safe because reshape is always prior to
resync in md_check_recovery(). However, if system reboots, then kernel will
complain cannot handle concurrent replacement and reshape and this array
is not able to assemble anymore.
Fix this problem by don't allow replacement until reshape is done.
Reported-by: Peter Neuwirth <reddunur@online.de>
Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-2-yukuai1@huaweicloud.com
If we write a large number to md/bitmap_set_bits, md_bitmap_checkpage()
will return -EINVAL because 'page >= bitmap->pages', but the return value
was not checked immediately in md_bitmap_get_counter() in order to set
*blocks value and slab-out-of-bounds occurs.
Move check of 'page >= bitmap->pages' to md_bitmap_get_counter() and
return directly if true.
Fixes: ef42567335 ("md/bitmap: optimise scanning of empty bitmaps.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230515134808.3936750-2-linan666@huaweicloud.com
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sb is just an on-stack pointer that can easily be reused by other calls.
Switch to use the bcache-wide bcache_kobj instead as there is no need to
claim per-bcache device anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230608110258.189493-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mode argument to the ->release block_device_operation is never used,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->open is only called on the whole device. Make that explicit by
passing a gendisk instead of the block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdev_check_media_change should only ever be called for the whole device.
Pass a gendisk to make that explicit and rename the function to
disk_check_media_change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MAX_CIPHER_BLOCKSIZE is an internal implementation detail and should
not be relied on by users of the Crypto API.
Instead of storing the IV on the stack, allocate it together with
the crypto request.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>