commit b55c3cd102 upstream.
We capture a NULL pointer issue when resizing a corrupt ext4 image which
is freshly clear resize_inode feature (not run e2fsck). It could be
simply reproduced by following steps. The problem is because of the
resize_inode feature was cleared, and it will convert the filesystem to
meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was
not reduced to zero, so could we mistakenly call reserve_backup_gdb()
and passing an uninitialized resize_inode to it when adding new group
descriptors.
mkfs.ext4 /dev/sda 3G
tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck
mount /dev/sda /mnt
resize2fs /dev/sda 8G
========
BUG: kernel NULL pointer dereference, address: 0000000000000028
CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748
...
RIP: 0010:ext4_flex_group_add+0xe08/0x2570
...
Call Trace:
<TASK>
ext4_resize_fs+0xbec/0x1660
__ext4_ioctl+0x1749/0x24e0
ext4_ioctl+0x12/0x20
__x64_sys_ioctl+0xa6/0x110
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2dd739617b
========
The fix is simple, add a check in ext4_resize_begin() to make sure that
the es->s_reserved_gdt_blocks is zero when the resize_inode feature is
disabled.
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc75a6eb85 upstream.
Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to
be a signed integer so we can correctly check for an error code returned
by dx_make_map().
Fixes: 46c116b920 ("ext4: verify dir block before splitting it")
Cc: stable@kernel.org
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a08f789d2a upstream.
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
ext4_mb_new_blocks+0x9df/0x5d30
ext4_ext_map_blocks+0x1803/0x4d80
ext4_map_blocks+0x3a4/0x1a10
ext4_writepages+0x126d/0x2c30
do_writepages+0x7f/0x1b0
__filemap_fdatawrite_range+0x285/0x3b0
file_write_and_wait_range+0xb1/0x140
ext4_sync_file+0x1aa/0xca0
vfs_fsync_range+0xfb/0x260
do_fsync+0x48/0xa0
[...]
==================================================================
Above issue may happen as follows:
-------------------------------------
do_fsync
vfs_fsync_range
ext4_sync_file
file_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
ext4_writepages
mpage_map_and_submit_extent
mpage_map_one_extent
ext4_map_blocks
ext4_mb_new_blocks
ext4_mb_normalize_request
>>> start + size <= ac->ac_o_ex.fe_logical
ext4_mb_regular_allocator
ext4_mb_simple_scan_group
ext4_mb_use_best_found
ext4_mb_new_preallocation
ext4_mb_new_inode_pa
ext4_mb_use_inode_pa
>>> set ac->ac_b_ex.fe_len <= 0
ext4_mb_mark_diskspace_used
>>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
we can easily reproduce this problem with the following commands:
`fallocate -l100M disk`
`mkfs.ext4 -b 1024 -g 256 disk`
`mount disk /mnt`
`fsstress -d /mnt -l 0 -n 1000 -p 1`
The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.
Cc: stable@kernel.org
Fixes: cd648b8a8f ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b6641dd95 upstream.
We got issue as follows:
[home]# mount /dev/sda test
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
[home]# dmesg
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
EXT4-fs (sda): Errors on filesystem, clearing orphan list.
EXT4-fs (sda): recovery complete
EXT4-fs (sda): mounted filesystem with ordered data mode. Quota mode: none.
[home]# debugfs /dev/sda
debugfs 1.46.5 (30-Dec-2021)
Checksum errors in superblock! Retrying...
Reason is ext4_orphan_cleanup will reset ‘s_last_orphan’ but not update
super block checksum.
To solve above issue, defer update super block checksum after
ext4_orphan_cleanup.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220525012904.1604737-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5f41fdaea6 upstream.
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3ba733f879 upstream.
A maliciously corrupted filesystem can contain cycles in the h-tree
stored inside a directory. That can easily lead to the kernel corrupting
tree nodes that were already verified under its hands while doing a node
split and consequently accessing unallocated memory. Fix the problem by
verifying traversed block numbers are unique.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 46c116b920 upstream.
Before splitting a directory block verify its directory entries are sane
so that the splitting code does not access memory it should not.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d36f6ed761 upstream.
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/extents_status.c:199!
[...]
RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline]
RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217
[...]
Call Trace:
ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766
ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561
ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964
ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384
ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567
ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980
ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031
ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257
v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63
v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82
vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368
dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490
ext4_quota_enable fs/ext4/super.c:6137 [inline]
ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163
ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754
mount_bdev+0x2e9/0x3b0 fs/super.c:1158
mount_fs+0x4b/0x1e4 fs/super.c:1261
[...]
==================================================================
Above issue may happen as follows:
-------------------------------------
ext4_fill_super
ext4_enable_quotas
ext4_quota_enable
ext4_iget
__ext4_iget
ext4_ext_check_inode
ext4_ext_check
__ext4_ext_check
ext4_valid_extent_entries
Check for overlapping extents does't take effect
dquot_enable
vfs_load_quota_inode
v2_check_quota_file
v2_read_header
ext4_quota_read
ext4_bread
ext4_getblk
ext4_map_blocks
ext4_ext_map_blocks
ext4_find_extent
ext4_cache_extents
ext4_es_cache_extent
ext4_es_cache_extent
__es_tree_search
ext4_es_end
BUG_ON(es->es_lblk + es->es_len < es->es_lblk)
The error ext4 extents is as follows:
0af3 0300 0400 0000 00000000 extent_header
00000000 0100 0000 12000000 extent1
00000000 0100 0000 18000000 extent2
02000000 0400 0000 14000000 extent3
In the ext4_valid_extent_entries function,
if prev is 0, no error is returned even if lblock<=prev.
This was intended to skip the check on the first extent, but
in the error image above, prev=0+1-1=0 when checking the second extent,
so even though lblock<=prev, the function does not return an error.
As a result, bug_ON occurs in __es_tree_search and the system panics.
To solve this problem, we only need to check that:
1. The lblock of the first extent is not less than 0.
2. The lblock of the next extent is not less than
the next block of the previous extent.
The same applies to extent_idx.
Cc: stable@kernel.org
Fixes: 5946d08937 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518120816.1541863-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c878bea3c9 upstream.
The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that
we are in the middle of replay the fast commit journal. This was
actually a mistake, since the sbi->s_mount_info is initialized from
es->s_state. Arguably s_mount_state is misleadingly named, but the
name is historical --- s_mount_state and s_state dates back to ext2.
What should have been used is the ext4_{set,clear,test}_mount_flag()
inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags.
The problem with using EXT4_FC_REPLAY is that a maliciously corrupted
superblock could result in EXT4_FC_REPLAY getting set in
s_mount_state. This bypasses some sanity checks, and this can trigger
a BUG() in ext4_es_cache_extent(). As a easy-to-backport-fix, filter
out the EXT4_FC_REPLAY bit for now. We should eventually transition
away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY.
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20220420192312.1655305-1-phind.uet@gmail.com
Link: https://lore.kernel.org/r/20220517174028.942119-1-tytso@mit.edu
Reported-by: syzbot+c7358a3cd05ee786eb31@syzkaller.appspotmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d63c00ea43 upstream.
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:
// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m
[ Update function documentation for ext4_trim_all_free -- TYT ]
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cb8435dc8b ]
The 'commit' option is only applicable for ext3 and ext4 filesystems,
and has never been accepted by the ext2 filesystem driver, so the ext4
driver shouldn't allow it on ext2 filesystems.
This fixes a failure in xfstest ext4/053.
Fixes: 8dc0aa8cf0 ("ext4: check incompatible mount options while mounting ext2/3")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 4fdccaa0d1 upstream
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.
We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 85d825dbf4 upstream.
If the file system does not use bigalloc, calculating the overhead is
cheap, so force the recalculation of the overhead so we don't have to
trust the precalculated overhead in the superblock.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 10b01ee92d upstream.
The kernel calculation was underestimating the overhead by not taking
into account the reserved gdt blocks. With this change, the overhead
calculated by the kernel matches the overhead calculation in mke2fs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a2b0b205d1 upstream.
We got issue as follows:
[home]# fsck.ext4 -fn ram0yb
e2fsck 1.45.6 (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Symlink /p3/d14/d1a/l3d (inode #3494) is invalid.
Clear? no
Entry 'l3d' in /p3/d14/d1a (3383) has an incorrect filetype (was 7, should be 0).
Fix? no
As the symlink file size does not match the file content. If the writeback
of the symlink data block failed, ext4_finish_bio() handles the end of IO.
However this function fails to mark the buffer with BH_write_io_error and
so when unmount does journal checkpoint it cannot detect the writeback
error and will cleanup the journal. Thus we've lost the correct data in the
journal area. To solve this issue, mark the buffer as BH_write_io_error in
ext4_finish_bio().
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321144438.201685-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ad5cd4f4ee upstream.
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range). Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220308185043.GA117678@magnolia
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cc5095747e ]
[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance. A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].
This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).
So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed. More discussion and background can be found in the
thread starting at [2].
[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bfdc502a4a ]
In case of flex_bg feature (which is by default enabled), extents for
any given inode might span across blocks from two different block group.
ext4_mb_mark_bb() only reads the buffer_head of block bitmap once for the
starting block group, but it fails to read it again when the extent length
boundary overflows to another block group. Then in this below loop it
accesses memory beyond the block group bitmap buffer_head and results
into a data abort.
for (i = 0; i < clen; i++)
if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state)
already++;
This patch adds this functionality for checking block group boundary in
ext4_mb_mark_bb() and update the buffer_head(bitmap_bh) for every different
block group.
w/o this patch, I was easily able to hit a data access abort using Power platform.
<...>
[ 74.327662] EXT4-fs error (device loop3): ext4_mb_generate_buddy:1141: group 11, block bitmap and bg descriptor inconsistent: 21248 vs 23294 free clusters
[ 74.533214] EXT4-fs (loop3): shut down requested (2)
[ 74.536705] Aborting journal on device loop3-8.
[ 74.702705] BUG: Unable to handle kernel data access on read at 0xc00000005e980000
[ 74.703727] Faulting instruction address: 0xc0000000007bffb8
cpu 0xd: Vector: 300 (Data Access) at [c000000015db7060]
pc: c0000000007bffb8: ext4_mb_mark_bb+0x198/0x5a0
lr: c0000000007bfeec: ext4_mb_mark_bb+0xcc/0x5a0
sp: c000000015db7300
msr: 800000000280b033
dar: c00000005e980000
dsisr: 40000000
current = 0xc000000027af6880
paca = 0xc00000003ffd5200 irqmask: 0x03 irq_happened: 0x01
pid = 5167, comm = mount
<...>
enter ? for help
[c000000015db7380] c000000000782708 ext4_ext_clear_bb+0x378/0x410
[c000000015db7400] c000000000813f14 ext4_fc_replay+0x1794/0x2000
[c000000015db7580] c000000000833f7c do_one_pass+0xe9c/0x12a0
[c000000015db7710] c000000000834504 jbd2_journal_recover+0x184/0x2d0
[c000000015db77c0] c000000000841398 jbd2_journal_load+0x188/0x4a0
[c000000015db7880] c000000000804de8 ext4_fill_super+0x2638/0x3e10
[c000000015db7a40] c0000000005f8404 get_tree_bdev+0x2b4/0x350
[c000000015db7ae0] c0000000007ef058 ext4_get_tree+0x28/0x40
[c000000015db7b00] c0000000005f6344 vfs_get_tree+0x44/0x100
[c000000015db7b70] c00000000063c408 path_mount+0xdd8/0xe70
[c000000015db7c40] c00000000063c8f0 sys_mount+0x450/0x550
[c000000015db7d50] c000000000035770 system_call_exception+0x4a0/0x4e0
[c000000015db7e10] c00000000000c74c system_call_common+0xec/0x250
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/2609bc8f66fc15870616ee416a18a3d392a209c4.1644992609.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a5c0e2fdf7 ]
ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and
flex_group->free_clusters. This patch fixes that.
Identified based on code review of ext4_mb_mark_bb() function.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7aab5c84a0 upstream.
We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information
/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks
ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bdc8a53a6f ]
in the follow scenario:
1. jbd start transaction n
2. task A get new handle for transaction n+1
3. task A do some actions and add inode to FC_Q_MAIN fc_q
4. jbd complete transaction n and clear FC_Q_MAIN fc_q
5. task A call fsync
Fast commit will lost the file actions during a full commit.
we should also add updates to staging queue during a full commit.
and in ext4_fc_cleanup(), when reset a inode's fc track range, check
it's i_sync_tid, if it bigger than current transaction tid, do not
rest it, or we will lost the track range.
And EXT4_MF_FC_COMMITTING is not needed anymore, so drop it.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Link: https://lore.kernel.org/r/20220117093655.35160-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e85c81ba88 ]
For the follow scenario:
1. jbd start commit transaction n
2. task A get new handle for transaction n+1
3. task A do some ineligible actions and mark FC_INELIGIBLE
4. jbd complete transaction n and clean FC_INELIGIBLE
5. task A call fsync
In this case fast commit will not fallback to full commit and
transaction n+1 also not handled by jbd.
Make ext4_fc_mark_ineligible() also record transaction tid for
latest ineligible case, when call ext4_fc_cleanup() check
current transaction tid, if small than latest ineligible tid
do not clear the EXT4_MF_FC_INELIGIBLE.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Suggested-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0915e464cb ]
Move fast commit stats updating logic to a separate function from
ext4_fc_commit(). This significantly improves readability of
ext4_fc_commit().
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit cdce59a154 upstream.
Current code does not fully takes care of krealloc() error case, which
could lead to silent memory corruption or a kernel bug. This patch
fixes that.
Also it cleans up some duplicated error handling logic from various
functions in fast_commit.c file.
Reported-by: luo penghao <luo.penghao@zte.com.cn>
Suggested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/62e8b6a1cce9359682051deb736a3c0953c9d1e9.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 897026aaa7 upstream.
While running "./check -I 200 generic/475" it sometimes gives below
kernel BUG(). Ideally we should not call ext4_write_inline_data() if
ext4_create_inline_data() has failed.
<log snip>
[73131.453234] kernel BUG at fs/ext4/inline.c:223!
<code snip>
212 static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
213 void *buffer, loff_t pos, unsigned int len)
214 {
<...>
223 BUG_ON(!EXT4_I(inode)->i_inline_off);
224 BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);
This patch handles the error and prints out a emergency msg saying potential
data loss for the given inode (since we couldn't restore the original
inline_data due to some previous error).
[ 9571.070313] EXT4-fs (dm-0): error restoring inline_data for inode -- potential data loss! (inode 1703982, error -30)
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/9f4cd7dfd54fa58ff27270881823d94ddf78dd07.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 31a074a0c6 upstream.
For now in ext4_mb_new_blocks_simple, if we found a block which
should be excluded then will switch to next group, this may
probably cause 'group' run out of range.
Change to check next block in the same group when get a block should
be excluded. Also change the search range to EXT4_CLUSTERS_PER_GROUP
and add error checking.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20220110035141.1980-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 599ea31d13 upstream.
During fast commit replay procedure, we clear inode blocks bitmap in
ext4_ext_clear_bb(), this may cause ext4_mb_new_blocks_simple() allocate
blocks still in use.
Make ext4_fc_record_regions() also record physical disk regions used by
inodes during replay procedure. Then ext4_mb_new_blocks_simple() can
excludes these blocks in use.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Link: https://lore.kernel.org/r/20220110035141.1980-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6eeaf88fd5 upstream.
We probably want to remove the indirect block to extents migration
feature after a deprecation window, but until then, let's fix a
potential data loss problem caused by the fact that we put the
tmp_inode on the orphan list. In the unlikely case where we crash and
do a journal recovery, the data blocks belonging to the inode being
migrated are also represented in the tmp_inode on the orphan list ---
and so its data blocks will get marked unallocated, and available for
reuse.
Instead, stop putting the tmp_inode on the oprhan list. So in the
case where we crash while migrating the inode, we'll leak an inode,
which is not a disaster. It will be easily fixed the next time we run
fsck, and it's better than potentially having blocks getting claimed
by two different files, and losing data as a result.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5c48a7df91 upstream.
Our syzkaller report an use-after-free issue that accessing the freed
buffer_head on the writeback page in __ext4_journalled_writepage(). The
problem is that if there was a truncate racing with the data=journalled
writeback procedure, the writeback length could become zero and
bget_one() refuse to get buffer_head's refcount, then the truncate
procedure release buffer once we drop page lock, finally, the last
ext4_walk_page_buffers() trigger the use-after-free problem.
sync truncate
ext4_sync_file()
file_write_and_wait_range()
ext4_setattr(0)
inode->i_size = 0
ext4_writepage()
len = 0
__ext4_journalled_writepage()
page_bufs = page_buffers(page)
ext4_walk_page_buffers(bget_one) <- does not get refcount
do_invalidatepage()
free_buffer_head()
ext4_walk_page_buffers(page_bufs) <- trigger use-after-free
After commit bdf96838ae ("ext4: fix race between truncate and
__ext4_journalled_writepage()"), we have already handled the racing
case, so the bget_one() and bput_one() are not needed. So this patch
simply remove these hunk, and recheck the i_size to make it safe.
Fixes: bdf96838ae ("ext4: fix race between truncate and __ext4_journalled_writepage()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211225090937.712867-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9725958bb7 upstream.
If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the
inode->i_size will not include the unwritten range. When call
ftruncate with fast commit enabled, it will miss to track the
unwritten range.
Change to trace the full range during ftruncate.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0b5b5a62b9 upstream.
For now ,we use ext4_punch_hole() during fast commit replay delete range
procedure. But it will be affected by inode->i_size, which may not
correct during fast commit replay procedure. The following test will
failed.
-create & write foo (len 1000K)
-falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
-create & fsync bar
-falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
-fsync foo
-crash before a full commit
After the fast_commit reply procedure, the range 400K-500K will not be
removed. Because in this case, when calling ext4_punch_hole() the
inode->i_size is 0, and it just retruns with doing nothing.
Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
to remove blocks of inode directly.
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e81c9302a6 upstream.
When migrating to extents, the temporary inode will have it's own checksum
seed. This means that, when swapping the inodes data, the inode checksums
will be incorrect.
This can be fixed by recalculating the extents checksums again. Or simply
by copying the seed into the temporary inode.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
Reported-by: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20211214175058.19511-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5e4d0eba1c upstream.
when call falloc with FALLOC_FL_ZERO_RANGE, to set an range to unwritten,
which has been already initialized. If the range is align to blocksize,
fast commit will not track range for this change.
Also track range for unwritten range in ext4_map_blocks().
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211221022839.374606-1-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c27c29c6af upstream.
It is not guaranteed that __ext4_get_inode_loc will definitely set
err_blk pointer when it returns EIO. To avoid using uninitialized
variables, let's first set err_blk to 0.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211201163421.2631661-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8c80fb312d upstream.
We found on older kernel (3.10) that in the scenario of insufficient
disk space, system may trigger an ABBA deadlock problem, it seems that
this problem still exists in latest kernel, try to fix it here. The
main process triggered by this problem is that task A occupies the PA
and waits for the jbd2 transaction finish, the jbd2 transaction waits
for the completion of task B's IO (plug_list), but task B waits for
the release of PA by task A to finish discard, which indirectly forms
an ABBA deadlock. The related calltrace is as follows:
Task A
vfs_write
ext4_mb_new_blocks()
ext4_mb_mark_diskspace_used() JBD2
jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction()
->schedule() filemap_fdatawait()
| |
| Task B |
| do_unlinkat() |
| ext4_evict_inode() |
| jbd2_journal_begin_ordered_truncate() |
| filemap_fdatawrite_range() |
| ext4_mb_new_blocks() |
-ext4_mb_discard_group_preallocations() <-----
Here, try to cancel ext4_mb_discard_group_preallocations() internal
retry due to PA busy, and do a limited number of retries inside
ext4_mb_discard_preallocations(), which can circumvent the above
problems, but also has some advantages:
1. Since the PA is in a busy state, if other groups have free PAs,
keeping the current PA may help to reduce fragmentation.
2. Continue to traverse forward instead of waiting for the current
group PA to be released. In most scenarios, the PA discard time
can be reduced.
However, in the case of smaller free space, if only a few groups have
space, then due to multiple traversals of the group, it may increase
CPU overhead. But in contrast, I feel that the overall benefit is
better than the cost.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1637630277-23496-1-git-send-email-brookxu.cn@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 15fc69bbbb upstream.
When we hit an error when enabling quotas and setting inode flags, we do
not properly shutdown quota subsystem despite returning error from
Q_QUOTAON quotactl. This can lead to some odd situations like kernel
using quota file while it is still writeable for userspace. Make sure we
properly cleanup the quota subsystem in case of error.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20211007155336.12493-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4013d47a53 upstream.
When we succeed in enabling some quota type but fail to enable another
one with quota feature, we correctly disable all enabled quota types.
However we forget to reset i_data_sem lockdep class. When the inode gets
freed and reused, it will inherit this lockdep class (i_data_sem is
initialized only when a slab is created) and thus eventually lockdep
barfs about possible deadlocks.
Reported-and-tested-by: syzbot+3b6f9218b1301ddda3e2@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20211007155336.12493-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>