Commit Graph

855 Commits

Author SHA1 Message Date
Christoph Hellwig
5298d4bfe8 unicode: clean up the Kconfig symbol confusion
Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d047870 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2022-01-20 19:57:24 -05:00
Linus Torvalds
3acbdbf42e dax + libnvdimm for v5.17
- Simplify the dax_operations API
   - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining
     and applying a partition offset to all its DAX iomap operations.
   - Remove wrappers and device-mapper stacked callbacks for
     ->copy_from_iter() and ->copy_to_iter() in favor of moving
     block_device relative offset responsibility to the
     dax_direct_access() caller.
   - Remove the need for an @bdev in filesystem-DAX infrastructure
   - Remove unused uio helpers copy_from_iter_flushcache() and
     copy_mc_to_iter() as only the non-check_copy_size() versions are
     used for DAX.
 - Prepare XFS for the pending (next merge window) DAX+reflink support
 - Remove deprecated DEV_DAX_PMEM_COMPAT support
 - Cleanup a straggling misuse of the GUID api
 
 Tags offered after the branch was cut:
 Reviewed-by: Mike Snitzer <snitzer@redhat.com>
 Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs
 Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8
 f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA=
 =QvDs
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax and libnvdimm updates from Dan Williams:
 "The bulk of this is a rework of the dax_operations API after
  discovering the obstacles it posed to the work-in-progress DAX+reflink
  support for XFS and other copy-on-write filesystem mechanics.

  Primarily the need to plumb a block_device through the API to handle
  partition offsets was a sticking point and Christoph untangled that
  dependency in addition to other cleanups to make landing the
  DAX+reflink support easier.

  The DAX_PMEM_COMPAT option has been around for 4 years and not only
  are distributions shipping userspace that understand the current
  configuration API, but some are not even bothering to turn this option
  on anymore, so it seems a good time to remove it per the deprecation
  schedule. Recall that this was added after the device-dax subsystem
  moved from /sys/class/dax to /sys/bus/dax for its sysfs organization.
  All recent functionality depends on /sys/bus/dax.

  Some other miscellaneous cleanups and reflink prep patches are
  included as well.

  Summary:

   - Simplify the dax_operations API:

      - Eliminate bdev_dax_pgoff() in favor of the filesystem
        maintaining and applying a partition offset to all its DAX iomap
        operations.

      - Remove wrappers and device-mapper stacked callbacks for
        ->copy_from_iter() and ->copy_to_iter() in favor of moving
        block_device relative offset responsibility to the
        dax_direct_access() caller.

      - Remove the need for an @bdev in filesystem-DAX infrastructure

      - Remove unused uio helpers copy_from_iter_flushcache() and
        copy_mc_to_iter() as only the non-check_copy_size() versions are
        used for DAX.

   - Prepare XFS for the pending (next merge window) DAX+reflink support

   - Remove deprecated DEV_DAX_PMEM_COMPAT support

   - Cleanup a straggling misuse of the GUID api"

* tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits)
  iomap: Fix error handling in iomap_zero_iter()
  ACPI: NFIT: Import GUID before use
  dax: remove the copy_from_iter and copy_to_iter methods
  dax: remove the DAXDEV_F_SYNC flag
  dax: simplify dax_synchronous and set_dax_synchronous
  uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()
  iomap: turn the byte variable in iomap_zero_iter into a ssize_t
  memremap: remove support for external pgmap refcounts
  fsdax: don't require CONFIG_BLOCK
  iomap: build the block based code conditionally
  dax: fix up some of the block device related ifdefs
  fsdax: shift partition offset handling into the file systems
  dax: return the partition offset from fs_dax_get_by_bdev
  iomap: add a IOMAP_DAX flag
  xfs: pass the mapping flags to xfs_bmbt_to_iomap
  xfs: use xfs_direct_write_iomap_ops for DAX zeroing
  xfs: move dax device handling into xfs_{alloc,free}_buftarg
  ext4: cleanup the dax handling in ext4_fill_super
  ext2: cleanup the dax handling in ext2_fill_super
  fsdax: decouple zeroing from the iomap buffered I/O code
  ...
2022-01-12 15:46:11 -08:00
xu xin
a2e3965df4 ext4: use BUG_ON instead of if condition followed by BUG
BUG_ON would be better.

This issue was detected with the help of Coccinelle.

Reported-by: Zeal robot <zealci@zte.com.cn>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Link: https://lore.kernel.org/r/20211228073252.580296-1-xu.xin16@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
Lukas Czerner
2327fb2e23 ext4: change s_last_trim_minblks type to unsigned long
There is no good reason for the s_last_trim_minblks to be atomic. There is
no data integrity needed and there is no real danger in setting and
reading it in a racy manner. Change it to be unsigned long, the same type
as s_clusters_per_group which is the maximum that's allowed.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
bbc605cdb1 ext4: implement support for get/set fs label
Implement support for FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls for
online reading and setting of file system label.

ext4_ioctl_getlabel() is simple, just get the label from the primary
superblock. This might not be the first sb on the file system if
'sb=' mount option is used.

In ext4_ioctl_setlabel() we update what ext4 currently views as a
primary superblock and then proceed to update backup superblocks. There
are two caveats:
 - the primary superblock might not be the first superblock and so it
   might not be the one used by userspace tools if read directly
   off the disk.
 - because the primary superblock might not be the first superblock we
   potentialy have to update it as part of backup superblock update.
   However the first sb location is a bit more complicated than the rest
   so we have to account for that.

The superblock modification is created generic enough so the
infrastructure can be used for other potential superblock modification
operations, such as chaning UUID.

Tested with generic/492 with various configurations. I also checked the
behavior with 'sb=' mount options, including very large file systems
with and without sparse_super/sparse_super2.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211213135618.43303-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Sebastian Andrzej Siewior
ab047d516d ext4: destroy ext4_fc_dentry_cachep kmemcache on module removal
The kmemcache for ext4_fc_dentry_cachep remains registered after module
removal.

Destroy ext4_fc_dentry_cachep kmemcache on module removal.

Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211110134640.lyku5vklvdndw6uk@linutronix.de
Link: https://lore.kernel.org/r/YbiK3JetFFl08bd7@linutronix.de
Link: https://lore.kernel.org/r/20211223164436.2628390-1-bigeasy@linutronix.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Harshad Shirwadkar
0915e464cb ext4: simplify updating of fast commit stats
Move fast commit stats updating logic to a separate function from
ext4_fc_commit(). This significantly improves readability of
ext4_fc_commit().

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23 18:13:25 -05:00
Harshad Shirwadkar
7bbbe241ec ext4: drop ineligible txn start stop APIs
This patch drops ext4_fc_start_ineligible() and
ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions
should simply call ext4_fc_mark_ineligible() after starting the
trasaction.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-3-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23 18:13:25 -05:00
Christoph Hellwig
cd913c76f4 dax: return the partition offset from fs_dax_get_by_bdev
Prepare for the removal of the block_device from the DAX I/O path by
returning the partition offset from fs_dax_get_by_bdev so that the file
systems have it at hand for use during I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:54 -08:00
Xiyu Yang
31d21d219b ext4: convert from atomic_t to refcount_t on ext4_io_end->count
refcount_t type and corresponding API can protect refcounters from
accidental underflow and overflow and further use-after-free situations.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1626674355-55795-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-11-04 10:33:24 -04:00
Linus Torvalds
ca3cef466f Fix a number of ext4 bugs in fast_commit, inline data, and delayed
allocation.  Also fix error handling code paths in ext4_dx_readdir()
 and ext4_fill_super().  Finally, avoid a grabbing a journal head in
 the delayed allocation write in the common cases where we are
 overwriting an pre-existing block or appending to an inode.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmFZ2SsACgkQ8vlZVpUN
 gaN6DAgAkIeisL1EfQT0VwshEs8y7N6IoX8dydLSRLpNf5oWiJOv2CTY9Qpi6X/C
 qNfuLsbJ2NXChvhIAM2hD82hvX21rYc6iqPxgho02VF4eYIP7NzLjwTFKnKbHPB5
 TiF498nJTnkcmSrJUEXmSAEdLoCwa5THH9+9HVHXZrkLXPULBtOOJ85mDAcIzVhV
 Zqb7yfbpWl0gnF0S0YjNATPtbhcC9EiC4MOVYVesRlgT9B3+k5q4fmVU0euTU9OH
 F2H6TNG+Mg/19gTnDP5acB9+eXHvYEqMpe+CaDifR9iFE9PTG/Edhxr6z9roXhHr
 kBvEVHSFH+YTEJXghnpS9YDd9Lwc9w==
 =WKzd
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of ext4 bugs in fast_commit, inline data, and delayed
  allocation.

  Also fix error handling code paths in ext4_dx_readdir() and
  ext4_fill_super().

  Finally, avoid a grabbing a journal head in the delayed allocation
  write in the common cases where we are overwriting a pre-existing
  block or appending to an inode"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: recheck buffer uptodate bit under buffer lock
  ext4: fix potential infinite loop in ext4_dx_readdir()
  ext4: flush s_error_work before journal destroy in ext4_fill_super
  ext4: fix loff_t overflow in ext4_max_bitmap_size()
  ext4: fix reserved space counter leakage
  ext4: limit the number of blocks in one ADD_RANGE TLV
  ext4: enforce buffer head state assertion in ext4_da_map_blocks
  ext4: remove extent cache entries when truncating inline data
  ext4: drop unnecessary journal handle in delalloc write
  ext4: factor out write end code of inline file
  ext4: correct the error path of ext4_write_inline_data_end()
  ext4: check and update i_disksize properly
  ext4: add error checking to ext4_ext_replay_set_iblocks()
2021-10-03 13:56:53 -07:00
Theodore Ts'o
11ef08c9eb Merge branch 'delalloc-buffer-write' into dev
Fix a bug in how we update i_disksize, and the error path in
inline_data_end.  Finally, drop an unnecessary creation of a journal
handle which was only needed for inline data, which can give us a
large performance gain in delayed allocation writes.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-09 10:47:06 -04:00
Zhang Yi
6984aef598 ext4: factor out write end code of inline file
Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
2021-09-04 23:38:18 -04:00
Linus Torvalds
111c1aa8ca In addition to some ext4 bug fixes and cleanups, this cycle we add the
orphan_file feature, which eliminates bottlenecks when doing a large
 number of parallel truncates and file deletions, and move the discard
 operation out of the jbd2 commit thread when using the discard mount
 option, to better support devices with slow discard operations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmEw5gEACgkQ8vlZVpUN
 gaMatgf9GKc2H3JUDGJVXrOE1EZWXzyDI+Tv6zt0iTr05zi1ahjGccAbmJXiAwxU
 Zy5TGr53CpPcGUG+sO4NpVzcH8q7cQeG0pVx9OnzJUdVfmv+htoNE0aAqUY5L3vg
 AxV4KPGgxPofRQa3QRE2LDFHIkNs7c0ncprdaAtxNztd09iFo7bIayt614mARK++
 HIO7VOGrH5Wya8SSoqYHmlO0g5viy3ypP6CpysIQw0JifSlHYkmYBUJ0/hwPV/Xl
 WfzmwQ9p43C9EXVmIN4++l674TDzkSn9ebITXOgkq4C8KjnFgyhKQIj5QVj81MvH
 dac5jxsuLTXTLYnRpAQ/duV4jRd+Fw==
 =+NN7
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "In addition to some ext4 bug fixes and cleanups, this cycle we add the
  orphan_file feature, which eliminates bottlenecks when doing a large
  number of parallel truncates and file deletions, and move the discard
  operation out of the jbd2 commit thread when using the discard mount
  option, to better support devices with slow discard operations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: make the updating inode data procedure atomic
  ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
  ext4: move inode eio simulation behind io completeion
  ext4: Improve scalability of ext4 orphan file handling
  ext4: Orphan file documentation
  ext4: Speedup ext4 orphan inode handling
  ext4: Move orphan inode handling into a separate file
  ext4: Support for checksumming from journal triggers
  ext4: fix race writing to an inline_data file while its xattrs are changing
  jbd2: add sparse annotations for add_transaction_credits()
  ext4: fix sparse warnings
  ext4: Make sure quota files are not grabbed accidentally
  ext4: fix e2fsprogs checksum failure for mounted filesystem
  ext4: if zeroout fails fall back to splitting the extent node
  ext4: reduce arguments of ext4_fc_add_dentry_tlv
  ext4: flush background discard kwork when retry allocation
  ext4: get discard out of jbd2 commit kthread contex
  ext4: remove the repeated comment of ext4_trim_all_free
  ext4: add new helper interface ext4_try_to_trim_range()
  ext4: remove the 'group' parameter of ext4_trim_extent
  ...
2021-09-02 09:37:09 -07:00
Jan Kara
4a79a98c7b ext4: Improve scalability of ext4 orphan file handling
Even though the length of the critical section when adding / removing
orphaned inodes was significantly reduced by using orphan file, the
contention of lock protecting orphan file still appears high in profiles
for truncate / unlink intensive workloads with high number of threads.

This patch makes handling of orphan file completely lockless. Also to
reduce conflicts between CPUs different CPUs start searching for empty
slot in orphan file in different blocks.

Performance comparison of locked orphan file handling, lockless orphan
file handling, and completely disabled orphan inode handling
from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on
SAS SSD disk, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time            Time            Time
        Orphan locked   Orphan lockless No orphan
  1       0.945600       0.939400        0.891200
  2       1.331800       1.246600        1.174400
  4       1.995000       1.780600        1.713200
  8       6.424200       4.900000        4.106000
 16      14.937600       8.516400        8.138000
 32      33.038200      24.565600       24.002200
 64      60.823600      39.844600       38.440200
128     122.941400      70.950400       69.315000

So we can see that with lockless orphan file handling, addition /
deletion of orphaned inodes got almost completely out of picture even
for a microbenchmark stressing it.

For reaim creat_clo workload on ramdisk there are also noticeable gains
(average of 5 runs):

Clients         Vanilla (ops/s)        Patched (ops/s)
creat_clo-1     14705.88 (   0.00%)    14354.07 *  -2.39%*
creat_clo-3     27108.43 (   0.00%)    28301.89 (   4.40%)
creat_clo-5     37406.48 (   0.00%)    45180.73 *  20.78%*
creat_clo-7     41338.58 (   0.00%)    54687.50 *  32.29%*
creat_clo-9     45226.13 (   0.00%)    62937.07 *  39.16%*
creat_clo-11    44000.00 (   0.00%)    65088.76 *  47.93%*
creat_clo-13    36516.85 (   0.00%)    68661.97 *  88.03%*
creat_clo-15    30864.20 (   0.00%)    69551.78 * 125.35%*
creat_clo-17    27478.45 (   0.00%)    67729.08 * 146.48%*
creat_clo-19    25000.00 (   0.00%)    61621.62 * 146.49%*
creat_clo-21    18772.35 (   0.00%)    63829.79 * 240.02%*
creat_clo-23    16698.94 (   0.00%)    61938.96 * 270.92%*
creat_clo-25    14973.05 (   0.00%)    56947.61 * 280.33%*
creat_clo-27    16436.69 (   0.00%)    65008.03 * 295.51%*
creat_clo-29    13949.01 (   0.00%)    69047.62 * 395.00%*
creat_clo-31    14283.52 (   0.00%)    67982.45 * 375.95%*

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:36:51 -04:00
Jan Kara
02f310fcf4 ext4: Speedup ext4 orphan inode handling
Ext4 orphan inode handling is a bottleneck for workloads which heavily
truncate / unlink small files since it contends on the global
s_orphan_mutex lock (and generally it's difficult to improve scalability
of the ondisk linked list of orphaned inodes).

This patch implements new way of handling orphan inodes. Instead of
linking orphaned inode into a linked list, we store it's inode number in
a new special file which we call "orphan file". Only if there's no more
space in the orphan file (too many inodes are currently orphaned) we
fall back to using old style linked list. Currently we protect
operations in the orphan file with a spinlock for simplicity but even in
this setting we can substantially reduce the length of the critical
section and thus speedup some workloads. In the next patch we improve
this by making orphan handling lockless.

Note that the change is backwards compatible when the filesystem is
clean - the existence of the orphan file is a compat feature, we set
another ro-compat feature indicating orphan file needs scanning for
orphaned inodes when mounting filesystem read-write. This ro-compat
feature gets cleared on unmount / remount read-only.

Some performance data from 80 CPU Xeon Server with 512 GB of RAM,
filesystem located on SSD, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time            Time
        Vanilla         Patched
  1       1.057200        0.945600
  2       1.680400        1.331800
  4       2.547000        1.995000
  8       7.049400        6.424200
 16      14.827800       14.937600
 32      40.948200       33.038200
 64      87.787400       60.823600
128     206.504000      122.941400

So we can see significant wins all over the board.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:36:51 -04:00
Jan Kara
25c6d98fc4 ext4: Move orphan inode handling into a separate file
Move functions for handling orphan inodes into a new file
fs/ext4/orphan.c to have them in one place and somewhat reduce size of
other files. No code changes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:36:51 -04:00
Jan Kara
188c299e2a ext4: Support for checksumming from journal triggers
JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:36:50 -04:00
Wang Jianchao
5036ab8df2 ext4: flush background discard kwork when retry allocation
The background discard kwork tries to mark blocks used and issue
discard. This can make filesystem suffer from NOSPC error, xfstest
generic/371 can fail due to it. Fix it by flushing discard kwork
in ext4_should_retry_alloc. At the same time, give up discard at
the moment.

Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:35:53 -04:00
Wang Jianchao
55cdd0af2b ext4: get discard out of jbd2 commit kthread contex
Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.

One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.

This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
 - mark blocks to be discarded as used if they have not been allocated
 - do discard
 - mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.

Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30 23:34:52 -04:00
Jan Kara
d4f5258eae ext4: Convert to use mapping->invalidate_lock
Convert ext4 to use mapping->invalidate_lock instead of its private
EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this
conversion we fix a long standing race between hole punching and read(2)
/ readahead(2) paths that can lead to stale page cache contents.

CC: <linux-ext4@vger.kernel.org>
CC: Ted Tso <tytso@mit.edu>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-13 14:29:00 +02:00
Jonathan Davies
d578b99443 ext4: notify sysfs on errors_count value change
After s_error_count is incremented, signal the change in the
corresponding sysfs attribute via sysfs_notify. This allows userspace to
poll() on changes to /sys/fs/ext4/*/errors_count.

[ Moved call of ext4_notify_error_sysfs() to flush_stashed_error_work()
  to avoid BUG's caused by calling sysfs_notify trying to sleep after
  being called from an invalid context. -- TYT ]

Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com>
Link: https://lore.kernel.org/r/20210611140209.28903-1-jonathan.davies@nutanix.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-29 21:06:02 -04:00
Eric Biggers
6d2424a845 ext4: fix comment for s_hash_unsigned
Fix the comment for s_hash_unsigned to not be the opposite of what it
actually is.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210527235557.2377525-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24 09:42:41 -04:00
Leah Rumancik
351a0a3fbc ext4: add ioctl EXT4_IOC_CHECKPOINT
ioctl EXT4_IOC_CHECKPOINT checkpoints and flushes the journal. This
includes forcing all the transactions to the log, checkpointing the
transactions, and flushing the log to disk. This ioctl takes u32 "flags"
as an argument. Three flags are supported. EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
can be used to verify input to the ioctl. It returns error if there is any
invalid input, otherwise it returns success without performing
any checkpointing. The other two flags, EXT4_IOC_CHECKPOINT_FLAG_DISCARD
and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT, can be used to issue requests to
discard or zeroout the journal logs blocks, respectively. At this
point, EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT is primarily added to enable
testing of this codepath on devices that don't support discard.
EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
cannot both be set.

Systems that wish to achieve content deletion SLO can set up a daemon
that calls this ioctl at a regular interval such that it matches with the
SLO requirement. Thus, with this patch, the ext4_dir_entry2 wipeout
patch[1], and the Ext4 "-o discard" mount option set, Ext4 can now
guarantee that all file contents, file metatdata, and filenames will not
be accessible through the filesystem and will have had discard or
zeroout requests issued for corresponding device blocks.

The __jbd2_journal_erase function could also be used to discard or
zero-fill the journal during journal load after recovery. This would
provide a potential solution to a journal replay bug reported earlier this
year[2]. After a successful journal recovery, e2fsck can call this ioctl to
discard the journal as well.

[1] https://lore.kernel.org/linux-ext4/YIHknqxngB1sUdie@mit.edu/
[2] https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/

Link: https://lore.kernel.org/r/20210518151327.130198-2-leah.rumancik@gmail.com
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-22 21:34:08 -04:00
Pavel Skripkin
618f003199 ext4: fix memory leak in ext4_fill_super
static int kthread(void *_create) will return -ENOMEM
or -EINTR in case of internal failure or
kthread_stop() call happens before threadfn call.

To prevent fancy error checking and make code
more straightforward we moved all cleanup code out
of kmmpd threadfn.

Also, dropped struct mmpd_data at all. Now struct super_block
is a threadfn data and struct buffer_head embedded into
struct ext4_sb_info.

Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Joseph Qi
5c680150d7 ext4: remove redundant check buffer_uptodate()
Now set_buffer_uptodate() will test first and then set, so we don't have
to check buffer_uptodate() first, remove it to simplify code.

Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619418587-5580-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Linus Torvalds
9f67672a81 New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory
 blocks by zeroing directory entries when they are unlinked or moved as
 part of a hash tree node split.  We also improve the block allocator's
 performance on a freshly mounted file system by prefetching block
 bitmaps.
 
 There are also the usual cleanups and bug fixes, including fixing a
 page cache invalidation race when there is mixed buffered and direct
 I/O and the block size is less than page size, and allow the dax flag
 to be set and cleared on inline directories.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmCLei4ACgkQ8vlZVpUN
 gaPZkgf/VH08xjMf3VthC+BpvVmChQXfV4yjigHbO2pmPyYWZhyJzkEGCQD8u2eB
 b7ShW+B1NCifcTU34xAkKHwEtakzzEv3WIMrT1oZNWrpfo8tt850EkwQggaGGDpd
 /HnP1/wLtziJ5hE6DwutmX7qB4VFghVj898MjDrEPSOBqItOjWps9mn/JWL7SHyI
 Dqzhf5XZTYPaXWuJmSmKw3q8O70JDHnZe/rRWlfX1jLI5KDtqp71Nw1B+gszUB66
 IUdncyZKvInsyjYhkbCQ8U6WFih82MrbKeuGYDp/RFvg5eMELEYkwT9j0ofuDHq8
 zn62sAlbOXv1DiqkPDHKVm9GkHx8/g==
 =UpnH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "New features for ext4 this cycle include support for encrypted
  casefold, ensure that deleted file names are cleared in directory
  blocks by zeroing directory entries when they are unlinked or moved as
  part of a hash tree node split. We also improve the block allocator's
  performance on a freshly mounted file system by prefetching block
  bitmaps.

  There are also the usual cleanups and bug fixes, including fixing a
  page cache invalidation race when there is mixed buffered and direct
  I/O and the block size is less than page size, and allow the dax flag
  to be set and cleared on inline directories"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
  ext4: wipe ext4_dir_entry2 upon file deletion
  ext4: Fix occasional generic/418 failure
  fs: fix reporting supported extra file attributes for statx()
  ext4: allow the dax flag to be set and cleared on inline directories
  ext4: fix debug format string warning
  ext4: fix trailing whitespace
  ext4: fix various seppling typos
  ext4: fix error return code in ext4_fc_perform_commit()
  ext4: annotate data race in jbd2_journal_dirty_metadata()
  ext4: annotate data race in start_this_handle()
  ext4: fix ext4_error_err save negative errno into superblock
  ext4: fix error code in ext4_commit_super
  ext4: always panic when errors=panic is specified
  ext4: delete redundant uptodate check for buffer
  ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
  ext4: make prefetch_block_bitmaps default
  ext4: add proc files to monitor new structures
  ext4: improve cr 0 / cr 1 group scanning
  ext4: add MB_NUM_ORDERS macro
  ext4: add mballoc stats proc file
  ...
2021-04-30 15:35:30 -07:00
Miklos Szeredi
4db5c2e623 ext4: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2021-04-12 15:04:29 +02:00
Harshad Shirwadkar
21175ca434 ext4: make prefetch_block_bitmaps default
Block bitmap prefetching is needed for these allocator optimization
data structures to get populated and provide better group scanning
order. So, turn it on bu default. prefetch_block_bitmaps mount option
is now marked as removed and a new option no_prefetch_block_bitmaps is
added to disable block bitmap prefetching.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
f68f406385 ext4: add proc files to monitor new structures
This patch adds a new file "mb_structs_summary" which allows us to see
the summary of the new allocator structures added in this
series. Here's the sample output of file:

optimize_scan: 1
max_free_order_lists:
        list_order_0_groups: 0
        list_order_1_groups: 0
        list_order_2_groups: 0
        list_order_3_groups: 0
        list_order_4_groups: 0
        list_order_5_groups: 0
        list_order_6_groups: 0
        list_order_7_groups: 0
        list_order_8_groups: 0
        list_order_9_groups: 0
        list_order_10_groups: 0
        list_order_11_groups: 0
        list_order_12_groups: 0
        list_order_13_groups: 40
fragment_size_tree:
        tree_min: 16384
        tree_max: 32768
        tree_nodes: 40

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
196e402adf ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.

At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).

For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.

There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.

All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".

With this patchset, following experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations introduced
in this patch:

|---------+------------------------------+---------------------------|
|         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s                     | 2m47.642s                 |
| 2nd run | 2m28.390s                    | 0m0.611s                  |
| 3rd run | 2m26.530s                    | 0m1.255s                  |
|---------+------------------------------+---------------------------|

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
a6c75eaf11 ext4: add mballoc stats proc file
Add new stats for measuring the performance of mballoc. This patch is
forked from Artem Blagodarenko's work that can be found here:

https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch

This patch reorganizes the stats by cr level. This is how the output
looks like:

mballoc:
	reqs: 0
	success: 0
	groups_scanned: 0
	cr0_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr1_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr2_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	cr3_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	extents_scanned: 0
		goal_hits: 0
		2^n_hits: 0
		breaks: 0
		lost: 0
	buddies_generated: 0/40
	buddies_time_used: 0
	preallocated: 0
	discarded: 0

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
67d2518604 ext4: drop s_mb_bal_lock and convert protected fields to atomic
s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:58 -04:00
Daniel Rosenberg
1ae98e295f ext4: optimize match for casefolded encrypted dirs
Matching names with casefolded encrypting directories requires
decrypting entries to confirm case since we are case preserving. We can
avoid needing to decrypt if our hash values don't match.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-05 22:18:36 -04:00
Daniel Rosenberg
471fbbea7f ext4: handle casefolding with encryption
This adds support for encryption with casefolding.

Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.

The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.

[ Changed to advertise this feature via the file:
  /sys/fs/ext4/features/encrypted_casefold -- TYT ]

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-05 22:04:20 -04:00
Harshad Shirwadkar
8210bb29c1 ext4: fix rename whiteout with fast commit
This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:

ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3

So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.

Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-21 00:38:18 -04:00
Eric Whitney
efc6134527 ext4: shrink race window in ext4_should_retry_alloc()
When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
fails at significant rates on the two test scenarios that disable
delayed allocation (ext3conv and data_journal) and force actual block
allocation for the fallocate and pwrite functions in the test.  The
failure rate on 5.10 for both ext3conv and data_journal on one test
system typically runs about 85%.  On 5.11, the failure rate on ext3conv
sometimes drops to as low as 1% while the rate on data_journal
increases to nearly 100%.

The observed failures are largely due to ext4_should_retry_alloc()
cutting off block allocation retries when s_mb_free_pending (used to
indicate that a transaction in progress will free blocks) is 0.
However, free space is usually available when this occurs during runs
of generic/371.  It appears that a thread attempting to allocate
blocks is just missing transaction commits in other threads that
increase the free cluster count and reset s_mb_free_pending while
the allocating thread isn't running.  Explicitly testing for free space
availability avoids this race.

The current code uses a post-increment operator in the conditional
expression that determines whether the retry limit has been exceeded.
This means that the conditional expression uses the value of the
retry counter before it's increased, resulting in an extra retry cycle.
The current code actually retries twice before hitting its retry limit
rather than once.

Increasing the retry limit to 3 from the current actual maximum retry
count of 2 in combination with the change described above reduces the
observed failure rate to less that 0.1% on both ext3conv and
data_journal with what should be limited impact on users sensitive to
the overhead caused by retries.

A per filesystem percpu counter exported via sysfs is added to allow
users or developers to track the number of times the retry limit is
exceeded without resorting to debugging methods.  This should provide
some insight into worst case retry behavior.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:10 -05:00
Christian Brauner
14f3db5542
ext4: support idmapped mounts
Enable idmapped mounts for ext4. All dedicated helpers we need for this
exist. So this basically just means we're passing down the
user_namespace argument from the VFS methods to the relevant helpers.

Let's create simple example where we idmap an ext4 filesystem:

 root@f2-vm:~# truncate -s 5G ext4.img

 root@f2-vm:~# mkfs.ext4 ./ext4.img
 mke2fs 1.45.5 (07-Jan-2020)
 Discarding device blocks: done
 Creating filesystem with 1310720 4k blocks and 327680 inodes
 Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
 Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (16384 blocks): done
 Writing superblocks and filesystem accounting information: done

 root@f2-vm:~# losetup -f --show ./ext4.img
 /dev/loop0

 root@f2-vm:~# mount /dev/loop0 /mnt

 root@f2-vm:~# ls -al /mnt/
 total 24
 drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
 drwx------  2 root root 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped1 where we map uid and gid
 # 0 to uid and gid 1000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/

 root@f2-vm:/# ls -al /idmapped1/
 total 24
 drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
 drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped2 where we map uid and gid
 # 0 to uid and gid 2000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/

 root@f2-vm:/# ls -al /idmapped2/
 total 24
 drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
 drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
 drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found

Let's create another example where we idmap the rootfs filesystem
without a mapping for uid 0 and gid 0:

 # Create an idmapped mount of for a full POSIX range of rootfs under
 # /mnt but without a mapping for uid 0 to reduce attack surface

 root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/

 # Since we don't have a mapping for uid and gid 0 all files owned by
 # uid and gid 0 should show up as uid and gid 65534:
 root@f2-vm:/# ls -al /mnt/
 total 664
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
 drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
 drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
 lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
 drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
 drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
 lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
 drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
 drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
 drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var

 # Since we do have a mapping for uid and gid 1000 all files owned by
 # uid and gid 1000 should simply show up as uid and gid 1000:
 root@f2-vm:/# ls -al /mnt/home/ubuntu/
 total 40
 drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
 drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
 -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
 -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
 -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
 -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
 -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
 -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo

Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Linus Torvalds
555a6e8c11 Various bug fixes and cleanups for ext4; no new features this cycle.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEOrBXt+eNlFyMVZH70292m8EYBPAFAl/iRkQACgkQ0292m8EY
 BPDxnw/+Nyp5/bscuKflyAs8m8BUDE8MSP9TTaPcflDgyV1Z3QiSx5ITOXi4JSy6
 V5W1RXScJL3mG8QrtukY7zP7bSTS+UDQlbZ1BlpiyZDjo3l15gKwWVOmJteGusfI
 U+xRz1RoiXcGTq6vSDYZIKquNVNezAE0UbYVRECm+lEjfka1vrlQz20gEgvh/mvh
 ZQV5wLqr61gGsGDhHSw+HD16KYJQWtwcRirNuzw7VfFtUJNfthXkPy8mIuJcKx2a
 sHuGEnLxhwapl27Os2o8gMTjt05nQZSZ6Fx3p7i3c4NBknR3F8ZI1LVSvDDZU6Rz
 v0v/VfMEP/CEg5Vq8gLL1dgWCwF0GWKYLdhyge/oCQU9r3RbheRKCRmOqyrYw5sA
 e4AqfWw39OmKV8L54XKUV4lXaBd7rCHW0l0uJdm1cdGxg7kewjS6TajhJ30tRqz/
 vvCo5HZ/TgF71AhONu+0WSzbuOOZzRxJC4k+WLRHspOzXVg7f8PzAxkfHMJaEk3t
 TsuM0prorm4r1PAb7TblXJpV7LlI2sjhYqsC/ngAfmZqw59wou3n+hytg8BkKlOz
 Gq9LUdFw4MJqK0OqaRzwWCcXFdc5EEFS8Kq4aVNLbNMmF1gfYy/RzTRc1iZ5RkN+
 stk7OYhs657Kv925nhvZsLCoOwGzp+8dbKmW2TlgQMwYHx30+Jw=
 =a5OE
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Various bug fixes and cleanups for ext4; no new features this cycle"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits)
  ext4: remove unnecessary wbc parameter from ext4_bio_write_page
  ext4: avoid s_mb_prefetch to be zero in individual scenarios
  ext4: defer saving error info from atomic context
  ext4: simplify ext4 error translation
  ext4: move functions in super.c
  ext4: make ext4_abort() use __ext4_error()
  ext4: standardize error message in ext4_protect_reserved_inode()
  ext4: remove redundant sb checksum recomputation
  ext4: don't remount read-only with errors=continue on reboot
  ext4: fix deadlock with fs freezing and EA inodes
  jbd2: add a helper to find out number of fast commit blocks
  ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
  ext4: fix fall-through warnings for Clang
  ext4: add docs about fast commit idempotence
  ext4: remove the unused EXT4_CURRENT_REV macro
  ext4: fix an IS_ERR() vs NULL check
  ext4: check for invalid block size early when mounting a file system
  ext4: fix a memory leak of ext4_free_data
  ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set()
  ext4: update ext4_data_block_valid related comments
  ...
2020-12-24 14:16:02 -08:00
Lei Chen
be993933d2 ext4: remove unnecessary wbc parameter from ext4_bio_write_page
ext4_bio_write_page does not need wbc parameter, since its parameter
io contains the io_wbc field. The io::io_wbc is initialized by
ext4_io_submit_init which is called in ext4_writepages and
ext4_writepage functions prior to ext4_bio_write_page.
Therefor, when ext4_bio_write_page is called, wbc info
has already been included in io parameter.

Signed-off-by: Lei Chen <lennychen@tencent.com>
Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:45 -05:00
Jan Kara
c92dc85684 ext4: defer saving error info from atomic context
When filesystem inconsistency is detected with group locked, we
currently try to modify superblock to store error there without
blocking. However this can cause superblock checksum failures (or
DIF/DIX failure) when the superblock is just being written out.

Make error handling code just store error information in ext4_sb_info
structure and copy it to on-disk superblock only in ext4_commit_super().
In case of error happening with group locked, we just postpone the
superblock flushing to a workqueue.

[ Added fixup so that s_first_error_* does not get updated after
  the file system is remounted.
  Also added fix for syzbot failure.  - Ted ]

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
2020-12-22 13:07:49 -05:00
Linus Torvalds
ff49c86f27 f2fs-for-5.11-rc1
In this round, we've made more work into per-file compression support. For
 example, F2FS_IOC_GET|SET_COMPRESS_OPTION provides a way to change the
 algorithm or cluster size per file. F2FS_IOC_COMPRESS|DECOMPRESS_FILE provides
 a way to compress and decompress the existing normal files manually along with
 a new mount option, compress_mode=fs|user, which can control who compresses the
 data. Chao also added a checksum feature with a mount option so that we are able
 to detect any corrupted cluster. In addition, Daniel contributed casefolding
 with encryption patch, which will be used for Android devices.
 
 Enhancement:
  - add ioctls and mount option to manage per-file compression feature
  - support casefolding with encryption
  - support checksum for compressed cluster
  - avoid IO starvation by replacing mutex with rwsem
  - add sysfs, max_io_bytes, to control max bio size
 
 Bug fix:
  - fix use-after-free issue when compression and fsverity are enabled
  - fix consistency corruption during fault injection test
  - fix data offset for lseek
  - get rid of buffer_head which has 32bits limit in fiemap
  - fix some bugs in multi-partitions support
  - fix nat entry count calculation in shrinker
  - fix some stat information
 
 And, we've refactored some logics and fix minor bugs as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl/a8ywACgkQQBSofoJI
 UNLa2RAAjK+6tOs+NuYx2w9SegghKxwCg4Mb362BMdaAGx6GzMqAkCiVdujuoz/r
 +wy8sdqO9QE7723ZDNsebNMLRnkNPHnpneSL2p6OsSLJrD3ORTELVRrzNlkemvnK
 rRHZyYnNJvQQnD4uU7ABvROKsIDw/nCfcFvzHmLIgEw8EHO0W4n6fTtBdTwXv1qi
 N3qXhGuQldonR9XICuGjzj7wh17n9ua6Mr12XX3Ok38giMcZb9KFBwgvlhl35cxt
 htEmUpxWD3NTSw6zJmV4VAiajpiIkW6QRQuVA1nzdLZK644gaJMhM1EUsOnZhfDl
 wX0ZtKoNkXxb0glD34O3aYqeHJ3tHWgPmmpVm9TECJP9A/X7kmEHgQYpH/eJ9I7d
 tk51Uz28Mz1RShXU4i5RyKZeeoNTLiVlqiC95E2cnq4C1tLOJyI00N9AinrLzvR+
 fqUrAwCrBpiYX63mWKYwq7GWxWwp4+PY09kyIZxxJiWhTE/St0bRx2bQL8zA8C6J
 Rtxl+QWyQhkFbNu8fAukLFAhC6mqX/FKpXvUqRehBnHRvMWBiVZG0//eOPQLk71u
 qsdCgYuEVcg3itDQrZvmsjxi4Pb5E9mNr0s5oC4I2WvBPMheD4esSyG7cKDN0qfS
 3FFHlRYLOvnjPMLnKTmZXjFvFyHR8mwsD4Z83MeSrqYnWC14tFY=
 =KneU
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've made more work into per-file compression support.

  For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to
  change the algorithm or cluster size per file. F2FS_IOC_COMPRESS |
  DECOMPRESS_FILE provides a way to compress and decompress the existing
  normal files manually.

  There is also a new mount option, compress_mode=fs|user, which can
  control who compresses the data.

  Chao also added a checksum feature with a mount option so that
  we are able to detect any corrupted cluster.

  In addition, Daniel contributed casefolding with encryption patch,
  which will be used for Android devices.

  Summary:

  Enhancements:
   - add ioctls and mount option to manage per-file compression feature
   - support casefolding with encryption
   - support checksum for compressed cluster
   - avoid IO starvation by replacing mutex with rwsem
   - add sysfs, max_io_bytes, to control max bio size

  Bug fixes:
   - fix use-after-free issue when compression and fsverity are enabled
   - fix consistency corruption during fault injection test
   - fix data offset for lseek
   - get rid of buffer_head which has 32bits limit in fiemap
   - fix some bugs in multi-partitions support
   - fix nat entry count calculation in shrinker
   - fix some stat information

  And, we've refactored some logics and fix minor bugs as well"

* tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
  f2fs: compress: fix compression chksum
  f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
  f2fs: fix race of pending_pages in decompression
  f2fs: fix to account inline xattr correctly during recovery
  f2fs: inline: fix wrong inline inode stat
  f2fs: inline: correct comment in f2fs_recover_inline_data
  f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
  f2fs: convert to F2FS_*_INO macro
  f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
  f2fs: don't allow any writes on readonly mount
  f2fs: avoid race condition for shrinker count
  f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
  f2fs: add compress_mode mount option
  f2fs: Remove unnecessary unlikely()
  f2fs: init dirty_secmap incorrectly
  f2fs: remove buffer_head which has 32bits limit
  f2fs: fix wrong block count instead of bytes
  f2fs: use new conversion functions between blks and bytes
  f2fs: rename logical_to_blk and blk_to_logical
  f2fs: fix kbytes written stat for multi-device case
  ...
2020-12-17 11:18:00 -08:00
Jan Kara
014c9caa29 ext4: make ext4_abort() use __ext4_error()
The only difference between __ext4_abort() and __ext4_error() is that
the former one ignores errors=continue mount option. Unify the code to
reduce duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:55 -05:00
Kaixu Xia
03505c58b8 ext4: remove the unused EXT4_CURRENT_REV macro
There are no callers of the EXT4_CURRENT_REV macro, so remove it.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17 13:30:44 -05:00
Chunguang Xu
837c23fbc1 ext4: use ASSERT() to replace J_ASSERT()
There are currently multiple forms of assertion, such as J_ASSERT().
J_ASEERT() is provided for the jbd module, which is a public module.
Maybe we should use custom ASSERT() like other file systems, such as
xfs, which would be better.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:36:57 -05:00
Roman Anufriev
f177ee0882 ext4: add helpers for checking whether quota can be enabled/is journalled
Right now, there are several places, where we check whether fs is
capable of enabling quota or if quota is journalled with quite long
and non-self-descriptive condition statements.

This patch wraps these statements into helpers for better readability
and easier usage.

Link: https://lore.kernel.org/r/1603336860-16153-1-git-send-email-dotdot@yandex-team.ru
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:18:48 -05:00
Daniel Rosenberg
bb9cd9106b fscrypt: Have filesystems handle their d_ops
This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.

Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.

Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Jan Kara
f902b21650 ext4: fix bogus warning in ext4_update_dx_flag()
The idea of the warning in ext4_update_dx_flag() is that we should warn
when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
checksums enabled since after clearing the flag, checksums for internal
htree nodes will become invalid. So there's no need to warn (or actually
do anything) when EXT4_INODE_INDEX is not set.

Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
Fixes: 48a3431195 ("ext4: fix checksum errors with indexed dirs")
Reported-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-11-19 22:41:10 -05:00
Harshad Shirwadkar
a72b38eebe ext4: handle dax mount option collision
Mount options dax=inode and dax=never collided with fast_commit and
journal checksum. Redefine the mount flags to remove the collision.

Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: 9cb20f94af ("fs/ext4: Make DAX mount option a tri-state")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201111183209.447175-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-11 14:23:29 -05:00
Harshad Shirwadkar
9b5f6c9b83 ext4: make s_mount_flags modifications atomic
Fast commit file system states are recorded in
sbi->s_mount_flags. Fast commit expects these bit manipulations to be
atomic. This patch adds helpers to make those modifications atomic.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06 23:01:05 -05:00
Harshad Shirwadkar
1ceecb537f ext4: fix inode dirty check in case of fast commits
In case of fast commits, determine if the inode is dirty by checking
if the inode is on fast commit list. This also helps us get rid of
ext4_inode_info.i_fc_committed_subtid field.

Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-18-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06 23:01:05 -05:00
Harshad Shirwadkar
a80f7fcf18 ext4: fixup ext4_fc_track_* functions' signature
Firstly, pass handle to all ext4_fc_track_* functions and use
transaction id found in handle->h_transaction->h_tid for tracking fast
commit updates. Secondly, don't pass inode to
ext4_fc_track_link/create/unlink functions. inode can be found inside
these functions as d_inode(dentry). However, rename path is an
exeception. That's because in that case, we need inode that's not same
as d_inode(dentry). To handle that, add a couple of low-level wrapper
functions that take inode and dentry as arguments.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06 23:01:02 -05:00
Harshad Shirwadkar
a44ad6835d ext4: describe fast_commit feature flags
Fast commit feature has flags in the file system as well in JBD2. The
meaning of fast commit feature flags can get confusing. Update docs
and code to add more documentation about it.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201106035911.1942128-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06 23:01:01 -05:00
Daniel Rosenberg
f8f4acb6cd ext4: use generic casefolding support
This switches ext4 over to the generic support provided in libfs.

Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20201028050820.1636571-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28 13:43:13 -04:00
Harshad Shirwadkar
ababea77bc ext4: use s_mount_flags instead of s_mount_state for fast commit state
Ext4's fast commit related transient states should use
sb->s_mount_flags instead of persistent sb->s_mount_state.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201027044915.2553163-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28 13:42:10 -04:00
Linus Torvalds
96485e4462 The siginificant new ext4 feature this time around is Harshad's new
fast_commit mode.  In addition, thanks to Mauricio for fixing a race
 where mmap'ed pages that are being changed in parallel with a
 data=journal transaction commit could result in bad checksums in the
 failure that could cause journal replays to fail.  Also notable is
 Ritesh's buffered write optimization which can result in significant
 improvements on parallel write workloads.  (The kernel test robot
 reported a 330.6% improvement on fio.write_iops on a 96 core system
 using DAX[1].)
 
 Besides that, we have the usual miscellaneous cleanups and bug fixes.
 
 [1] https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl+RuCQACgkQ8vlZVpUN
 gaNebgf/dUnQp5SG2/2zczSDqr+f8DOiuAdn9I54BAr2HwdkMbbiktKfenfpu41k
 SMGNV6rYSs248dWFtkzM7C2T1dpGrdAe2OCYrU6HPR/xoZlx/RcDz39u7nXBDeup
 NV7RnPgIzCAGZXCOY/Zu1k88T1eosLRTIWvIcNOspt75MC0vJ8GSmkx1bVEUsv8w
 Uq6T0OREfDiLJpEZxtfbl3o+8Rfs82t3Soj4pwN8ESL/RWBTT8PlwAGhIcdjnHy/
 lsgT35IrY4OL6Eas9msUmFYrWhO6cW21kWOugYALQXZ3ny4A+r5nZZcY/wCq01NX
 J2Z02ZiMTZUmFFREbtc0eJukXWEVvA==
 =14K9
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The siginificant new ext4 feature this time around is Harshad's new
  fast_commit mode.

  In addition, thanks to Mauricio for fixing a race where mmap'ed pages
  that are being changed in parallel with a data=journal transaction
  commit could result in bad checksums in the failure that could cause
  journal replays to fail.

  Also notable is Ritesh's buffered write optimization which can result
  in significant improvements on parallel write workloads. (The kernel
  test robot reported a 330.6% improvement on fio.write_iops on a 96
  core system using DAX)

  Besides that, we have the usual miscellaneous cleanups and bug fixes"

Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
  ext4: fix invalid inode checksum
  ext4: add fast commit stats in procfs
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: fast commit recovery path
  jbd2: fast commit recovery path
  ext4: main fast-commit commit path
  jbd2: add fast commit machinery
  ext4 / jbd2: add fast commit initialization
  ext4: add fast_commit feature and handling for extended mount options
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: Detect already used quota file early
  jbd2: avoid transaction reuse after reformatting
  ext4: use the normal helper to get the actual inode
  ext4: fix bs < ps issue reported with dioread_nolock mount opt
  ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
  ext4: use ext4_sb_bread() instead of sb_bread()
  ...
2020-10-22 10:31:08 -07:00
Harshad Shirwadkar
ce8c59d197 ext4: add fast commit stats in procfs
This commit adds a file in procfs that tracks fast commit related
statistics.

root@kvm-xfstests:/mnt# cat /proc/fs/ext4/vdc/fc_info
fc stats:
7772 commits
15 ineligible
4083 numblks
2242us avg_commit_time
Ineligible reasons:
"Extended attributes changed":  0
"Cross rename": 0
"Journal flag changed": 0
"Insufficient memory":  0
"Swap boot":    0
"Resize":       0
"Dir renamed":  0
"Falloc range op":      0
"FC Commit Failed":     15

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-10-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:38 -04:00
Harshad Shirwadkar
8016e29f43 ext4: fast commit recovery path
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:38 -04:00
Harshad Shirwadkar
aa75f4d3da ext4: main fast-commit commit path
This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

    This part consists of helper functions to track changes that need
    to be committed during a commit operation. These updates are
    maintained by Ext4 in different in-memory queues. Following are
    the APIs and their short description that are implemented in this
    patch:

    - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
      operations
    - ext4_fc_track_range() - Track changed logical block offsets
      inodes
    - ext4_fc_track_inode() - Track inodes
    - ext4_fc_mark_ineligible() - Mark file system fast commit
      ineligible()
    - ext4_fc_start_update() / ext4_fc_stop_update() /
      ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
      functions are useful for co-ordinating inode updates with
      commits.

(B) Main commit Path

    This part consists of functions to convert updates tracked in
    in-memory data structures into on-disk commits. Function
    ext4_fc_commit() is the main entry point to commit path.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:37 -04:00
Harshad Shirwadkar
6866d7b3f2 ext4 / jbd2: add fast commit initialization
This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:26 -04:00
Harshad Shirwadkar
995a3ed67f ext4: add fast_commit feature and handling for extended mount options
We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn off the fast commit feature in Ext4.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:26 -04:00
zhangyi (F)
8394a6abf3 ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
Now we only use sb_bread_unmovable() to read superblock and descriptor
block at mount time, so there is no opportunity that we need to clear
buffer verified bit and also handle buffer write_io error bit. But for
the sake of unification, let's introduce ext4_sb_bread_unmovable() to
replace all sb_bread_unmovable(). After this patch, we stop using read
helpers in fs/buffer.c.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-8-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
5df1d4123d ext4: introduce ext4_sb_breadahead_unmovable() to replace sb_breadahead_unmovable()
If we readahead inode tables in __ext4_get_inode_loc(), it may bypass
buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable()
to handle this special case.

This patch also replace sb_breadahead_unmovable() in ext4_fill_super()
for the sake of unification.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
fa491b14cd ext4: introduce new metadata buffer read helpers
The previous patch add clear_buffer_verified() before we read metadata
block from disk again, but it's rather easy to miss clearing of this bit
because currently we read metadata buffer through different open codes
(e.g. ll_rw_block(), bh_submit_read() and invoke submit_bh() directly).
So, it's time to add common helpers to unify in all the places reading
metadata buffers instead. This patch add 3 helpers:

 - ext4_read_bh_nowait(): async read metadata buffer if it's actually
   not uptodate, clear buffer_verified bit before read from disk.
 - ext4_read_bh(): sync version of read metadata buffer, it will wait
   until the read operation return and check the return status.
 - ext4_read_bh_lock(): try to lock the buffer before read buffer, it
   will skip reading if the buffer is already locked.

After this patch, we need to use these helpers in all the places reading
metadata buffer instead of different open codes.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924073337.861472-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Chunguang Xu
addd752cff ext4: make mb_check_counter per group
Make bb_check_counter per group, so each group has the same chance
to be checked, which can expose errors more easily.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Chunguang Xu
dd0db94f30 ext4: rename system_blks to s_system_blks inside ext4_sb_info
Rename system_blks to s_system_blks inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Chunguang Xu
ee7ed3aa0f ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
Rename journal_dev to s_journal_dev inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Tian Tao
7eb90a2d6a ext4: remove unused including <linux/version.h>
Remove including <linux/version.h> that don't need it.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Link: https://lore.kernel.org/r/1600397165-42873-1-git-send-email-tiantao6@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:58 -04:00
Xiao Yang
aa2f77920b ext4: disallow modifying DAX inode flag if inline_data has been set
inline_data is mutually exclusive to DAX so enabling both of them triggers
the following issue:
------------------------------------------
# mkfs.ext4 -F -O inline_data /dev/pmem1
...
# mount /dev/pmem1 /mnt
# echo 'test' >/mnt/file
# lsattr -l /mnt/file
/mnt/file                    Inline_Data
# xfs_io -c "chattr +x" /mnt/file
# xfs_io -c "lsattr -v" /mnt/file
[dax] /mnt/file
# umount /mnt
# mount /dev/pmem1 /mnt
# cat /mnt/file
cat: /mnt/file: Numerical result out of range
------------------------------------------

Fixes: b383a73f2b ("fs/ext4: Introduce DAX inode flag")
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200828084330.15776-1-yangx.jy@cn.fujitsu.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Petr Malat
81e8c3c503 ext4: do not interpret high bytes if 64bit feature is disabled
Fields s_free_blocks_count_hi, s_r_blocks_count_hi and s_blocks_count_hi
are not valid if EXT4_FEATURE_INCOMPAT_64BIT is not enabled and should be
treated as zeroes.

Signed-off-by: Petr Malat <oss@malat.biz>
Link: https://lore.kernel.org/r/20200825150016.3363-1-oss@malat.biz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Eric Biggers
ac4acb1f4b fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.

Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing.  When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory.  This isn't actually used to do any
encryption, however, since the directory is still unencrypted!  Instead,
->i_crypt_info is only used for inheriting the encryption policy.

One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy".  In
commit ed318a6cc0 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required.  However, actually the nonce only
ends up being used to derive a key that is never used.

Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about.  For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption.  That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.

Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories.  This involves:

- Adding a function fscrypt_policy_to_inherit() which returns the
  encryption policy to inherit from a directory.  This can be a real
  policy, a dummy policy, or no policy.

- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
  with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.

- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
  of an inode.

Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22 06:48:49 -07:00
brookxu
27bc446e2d ext4: limit the length of per-inode prealloc list
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m2.051s
user    0m0.008s
sys     0m2.026s

Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m0.471s
user    0m0.004s
sys     0m0.395s

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19 12:04:36 -04:00
Kyoungho Koo
2fe34d2938 ext4: remove unused parameter of ext4_generic_delete_entry function
The ext4_generic_delete_entry function does not use the parameter
handle, so it can be removed.

Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-18 14:25:54 -04:00
Jan Kara
ce9f24cccd ext4: check journal inode extents more carefully
Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.

To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.

Fixes: 0a944e8a6c ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:36 -04:00
Dmitry Monakhov
1cf006ed19 ext4: export msg_count and warning_count via sysfs
This numbers can be analized by system automation similar to errors_count.
In ideal world it would be nice to have separate counters for different
log-levels, but this makes this patch too intrusive.

Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/20200725123313.4467-1-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:36 -04:00
Shijie Luo
6dbd300129 ext4: remove some redundant function declarations
ext4 update feature functions do not exist now, remove these useless
function declarations.

Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200724032954.22097-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:35 -04:00
Theodore Ts'o
3d392b2676 ext4: add prefetch_block_bitmaps mount option
For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted.  This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.

Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2020-08-07 14:12:35 -04:00
zhangyi (F)
bc71726c72 ext4: abort the filesystem if failed to async write metadata buffer
There is a risk of filesystem inconsistency if we failed to async write
back metadata buffer in the background. Because of current buffer's end
io procedure is handled by end_buffer_async_write() in the block layer,
and it only clear the buffer's uptodate flag and mark the write_io_error
flag, so ext4 cannot detect such failure immediately. In most cases of
getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
buffer's data is actually uptodate, it may still read data from disk
because the buffer's uptodate flag has been cleared. Finally, it may
lead to on-disk filesystem inconsistency if reading old data from the
disk successfully and write them out again.

This patch detect bdev mapping->wb_err when getting journal's write
access and mark the filesystem error if bdev's mapping->wb_err was
increased, this could prevent further writing and potential
inconsistency.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200620025427.1756360-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:34 -04:00
Alex Zhuravlev
cfd7323772 ext4: add prefetching for block allocation bitmaps
This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.

Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.

At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.

Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:

               0% full    50%-full unpatched    patched
  mount time       33s                9279s       563s

[ Restructured by tytso; removed the state flags in the allocation
  context, so it can be used to lazily prefetch the allocation bitmaps
  immediately after the file system is mounted.  Skip prefetching
  block groups which are uninitialized.  Finally pass in the
  REQ_RAHEAD flag to the block layer while prefetching. ]

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06 01:44:48 -04:00
Eric Biggers
cb29a02d3a ext4: use generic names for generic ioctls
Don't define EXT4_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name.  These aliases are unnecessary, and they make it unclear
which ioctls are ext4-specific and which are generic.

Exception: leave EXT4_IOC_GETVERSION_OLD and EXT4_IOC_SETVERSION_OLD
as-is for now, since renaming them to FS_IOC_GETVERSION and
FS_IOC_SETVERSION would probably make them more likely to be confused
with EXT4_IOC_GETVERSION and EXT4_IOC_SETVERSION which also exist.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200714230909.56349-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06 01:35:05 -04:00
Eric Biggers
2a12e147da ext4: don't hardcode bit values in EXT4_FL_USER_*
Define the EXT4_FL_USER_* constants by OR-ing together the appropriate
flags, rather than hard-coding a numeric value.  This makes it much
easier to see which flags are listed.

No change in the actual values.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200713031012.192440-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06 01:17:25 -04:00
Linus Torvalds
3be20b6fc1 This is the second round of ext4 commits for 5.8 merge window. It
includes the per-inode DAX support, which was dependant on the DAX
 infrastructure which came in via the XFS tree, and a number of
 regression and bug fixes; most notably the "BUG: using
 smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
 by syzkaller.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7mgCcACgkQ8vlZVpUN
 gaPftwf8C4w/7SG+CYLdwg0d2u9TKk77yDuWaioFHOcMSjZvG4TCSgtMhZxQnyty
 9t4yqacILx12pCj/mZnrZp5BOSn9O2ZbuDoXNKNrFXU0BF+CsbnhvJvrrh1j/MUa
 PPtcqyGFdOLSDvHSD9xPVT76juwh79aR8vB7qnQXaEO5wcLodZWoqBEFSKCl6Bo8
 hjXs1EXidusKsoarQxW6mEITmnhU2S2fuCVDgVcoM/LmKwzbgqvlWrentq9u8qLH
 W+XbjWgUtCM1byeDZWqe5FYyyJ8x+dTv7H5an3KR92EN6hKo5AOvzA0I41pZscq/
 bJ9p2THDxJQX4rJBevGAS5mZ6hTkRw==
 =z6eO
 -----END PGP SIGNATURE-----

Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull more ext4 updates from Ted Ts'o:
 "This is the second round of ext4 commits for 5.8 merge window [1].

  It includes the per-inode DAX support, which was dependant on the DAX
  infrastructure which came in via the XFS tree, and a number of
  regression and bug fixes; most notably the "BUG: using
  smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
  by syzkaller"

[1] The pull request actually came in 15 minutes after I had tagged the
    rc1 release. Tssk, tssk, late..   - Linus

* tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
  ext4: support xattr gnu.* namespace for the Hurd
  ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
  ext4: avoid utf8_strncasecmp() with unstable name
  ext4: stop overwrite the errcode in ext4_setup_super
  ext4: fix partial cluster initialization when splitting extent
  ext4: avoid race conditions when remounting with options that change dax
  Documentation/dax: Update DAX enablement for ext4
  fs/ext4: Introduce DAX inode flag
  fs/ext4: Remove jflag variable
  fs/ext4: Make DAX mount option a tri-state
  fs/ext4: Only change S_DAX on inode load
  fs/ext4: Update ext4_should_use_dax()
  fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
  fs/ext4: Disallow verity if inode is DAX
  fs/ext4: Narrow scope of DAX check in setflags
2020-06-15 09:32:10 -07:00
Theodore Ts'o
68cd44920d Enable ext4 support for per-file/directory dax operations
This adds the same per-file/per-directory DAX support for ext4 as was
done for xfs, now that we finally have consensus over what the
interface should be.
2020-06-11 10:51:44 -04:00
Linus Torvalds
0b166a57e6 A lot of bug fixes and cleanups for ext4, including:
* Fix performance problems found in dioread_nolock now that it is the
   default, caused by transaction leaks.
 * Clean up fiemap handling in ext4
 * Clean up and refactor multiple block allocator (mballoc) code
 * Fix a problem with mballoc with a smaller file systems running out
   of blocks because they couldn't properly use blocks that had been
   reserved by inode preallocation.
 * Fixed a race in ext4_sync_parent() versus rename()
 * Simplify the error handling in the extent manipulation code
 * Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
   ext4_make_inode_dirty()'s callers.
 * Avoid passing an error pointer to brelse in ext4_xattr_set()
 * Fix race which could result to freeing an inode on the dirty last
   in data=journal mode.
 * Fix refcount handling if ext4_iget() fails
 * Fix a crash in generic/019 caused by a corrupted extent node
 -----BEGIN PGP SIGNATURE-----
 
 iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
 gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
 /pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
 HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
 Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
 tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
 vsxsfTYMG18+2SxrJ9LwmagqmrRq
 =HMTs
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A lot of bug fixes and cleanups for ext4, including:

   - Fix performance problems found in dioread_nolock now that it is the
     default, caused by transaction leaks.

   - Clean up fiemap handling in ext4

   - Clean up and refactor multiple block allocator (mballoc) code

   - Fix a problem with mballoc with a smaller file systems running out
     of blocks because they couldn't properly use blocks that had been
     reserved by inode preallocation.

   - Fixed a race in ext4_sync_parent() versus rename()

   - Simplify the error handling in the extent manipulation code

   - Make sure all metadata I/O errors are felected to
     ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

   - Avoid passing an error pointer to brelse in ext4_xattr_set()

   - Fix race which could result to freeing an inode on the dirty last
     in data=journal mode.

   - Fix refcount handling if ext4_iget() fails

   - Fix a crash in generic/019 caused by a corrupted extent node"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
  ext4: avoid unnecessary transaction starts during writeback
  ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
  ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
  fs: remove the access_ok() check in ioctl_fiemap
  fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
  fs: move fiemap range validation into the file systems instances
  iomap: fix the iomap_fiemap prototype
  fs: move the fiemap definitions out of fs.h
  fs: mark __generic_block_fiemap static
  ext4: remove the call to fiemap_check_flags in ext4_fiemap
  ext4: split _ext4_fiemap
  ext4: fix fiemap size checks for bitmap files
  ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
  add comment for ext4_dir_entry_2 file_type member
  jbd2: avoid leaking transaction credits when unreserving handle
  ext4: drop ext4_journal_free_reserved()
  ext4: mballoc: use lock for checking free blocks while retrying
  ext4: mballoc: refactor ext4_mb_good_group()
  ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  ext4: mballoc: refactor ext4_mb_discard_preallocations()
  ...
2020-06-05 16:19:28 -07:00
Christoph Hellwig
10c5db2864 fs: move the fiemap definitions out of fs.h
No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Ritesh Harjani
175efa81fe ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Jonathan Grant
9f364e1d95 add comment for ext4_dir_entry_2 file_type member
Signed-off-by: Jonathan Grant <jg@jguk.org>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Ritesh Harjani
993778306e ext4: mballoc: use lock for checking free blocks while retrying
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Eric Biggers
de8ff14cab ext4: add casefold flag to EXT4_INODE_* flags
No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the
EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions.

Also make it clearer that the casefold flag is only for directories.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
70aa1554b0 ext4: make ext_debug() implementation to use pr_debug()
ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.

So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())

This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
6db0746189 ext4: use BIT() macro for BH_** state bits
Simply use BIT() macro for all BH_** state bits instead of open
coding it.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Theodore Ts'o
73c384c0cd ext4: avoid ext4_error()'s caused by ENOMEM in the truncate path
We can't fail in the truncate path without requiring an fsck.
Add work around for this by using a combination of retry loops
and the __GFP_NOFAIL flag.

From: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Harshad Shirwadkar
4209ae12b1 ext4: handle ext4_mark_inode_dirty errors
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
9e52484c71 ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag
The eofblocks code was removed in the 5.7 release by "ext4: remove
EOFBLOCKS_FL and associated code" (4337ecd1fe).  The ext4_map_blocks()
flag used to trigger it can now be removed as well.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:48 -04:00
Matthew Wilcox (Oracle)
a07f624bd6 ext4: pass the inode to ext4_mpage_readpages
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-22-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
6311f91f76 ext4: convert from readpages to readahead
Use the new readahead operation in ext4

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-21-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Linus Torvalds
afdb0f2ec5 fscrypt updates for 5.8
- Add the IV_INO_LBLK_32 encryption policy flag which modifies the
   encryption to be optimized for eMMC inline encryption hardware.
 
 - Make the test_dummy_encryption mount option for ext4 and f2fs support
   v2 encryption policies.
 
 - Fix kerneldoc warnings and some coding style inconsistencies.
 
 There will be merge conflicts with the ext4 and f2fs trees due to the
 test_dummy_encryption change, but the resolutions are straightforward.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXtScMBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxC6AP0eOEkMrc9e10YftdN6xsyRjvqiPyFg
 oMjuU+SvQ+/sVgEAo0mBFITnl75ZGb8PyqXCNMDAy6uHaxcEjVGufx5q2QE=
 =dbxy
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:

 - Add the IV_INO_LBLK_32 encryption policy flag which modifies the
   encryption to be optimized for eMMC inline encryption hardware.

 - Make the test_dummy_encryption mount option for ext4 and f2fs support
   v2 encryption policies.

 - Fix kerneldoc warnings and some coding style inconsistencies.

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: add support for IV_INO_LBLK_32 policies
  fscrypt: make test_dummy_encryption use v2 by default
  fscrypt: support test_dummy_encryption=v2
  fscrypt: add fscrypt_add_test_dummy_key()
  linux/parser.h: add include guards
  fscrypt: remove unnecessary extern keywords
  fscrypt: name all function parameters
  fscrypt: fix all kerneldoc warnings
2020-06-01 12:10:17 -07:00
Ira Weiny
b383a73f2b fs/ext4: Introduce DAX inode flag
Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4
inode.

Set the flag to be user visible and changeable.  Set the flag to be
inherited.  Allow applications to change the flag at any time except if
it conflicts with the set of mutually exclusive flags (Currently VERITY,
ENCRYPT, JOURNAL_DATA).

Furthermore, restrict setting any of the exclusive flags if DAX is set.

While conceptually possible, we do not allow setting EXT4_DAX_FL while
at the same time clearing exclusion flags (or vice versa) for 2 reasons:

	1) The DAX flag does not take effect immediately which
	   introduces quite a bit of complexity
	2) There is no clear use case for being this flexible

Finally, on regular files, flag the inode to not be cached to facilitate
changing S_DAX on the next creation of the inode.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
9cb20f94af fs/ext4: Make DAX mount option a tri-state
We add 'always', 'never', and 'inode' (default).  '-o dax' continues to
operate the same which is equivalent to 'always'.  This new
functionality is limited to ext4 only.

Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set
it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode.

We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX.

Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user
specified that option for printing.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-7-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
043546e46d fs/ext4: Only change S_DAX on inode load
To prevent complications with in memory inodes we only set S_DAX on
inode load.  FS_XFLAG_DAX can be changed at any time and S_DAX will
change after inode eviction and reload.

Add init bool to ext4_set_inode_flags() to indicate if the inode is
being newly initialized.

Assert that S_DAX is not set on an inode which is just being loaded.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
a8ab6d3885 fs/ext4: Update ext4_should_use_dax()
S_DAX should only be enabled when the underlying block device supports
dax.

Cache the underlying support for DAX in the super block and modify
ext4_should_use_dax() to check for device support prior to the over
riding mount option.

While we are at it change the function to ext4_should_enable_dax() as
this better reflects the ask as well as matches xfs.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-5-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
fc626fe322 fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
In prep for the new tri-state mount option which then introduces
EXT4_MOUNT_DAX_NEVER.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-4-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ritesh Harjani
9f44eda195 ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-19 15:03:37 -04:00
Eric Biggers
ed318a6cc0 fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.

Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.

To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2".  The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.

To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.

To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount.  (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)

Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'.  On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.

Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-18 20:21:48 -07:00
Theodore Ts'o
54d3adbc29 ext4: save all error info in save_error_info() and drop ext4_set_errno()
Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero.  It's also less racy to set all of
the error information all at once.  (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-01 17:29:06 -04:00
Eric Whitney
4337ecd1fe ext4: remove EXT4_EOFBLOCKS_FL and associated code
The EXT4_EOFBLOCKS_FL inode flag is used to indicate whether a file
contains unwritten blocks past i_size.  It's set when ext4_fallocate
is called with the KEEP_SIZE flag to extend a file with an unwritten
extent.  However, this flag hasn't been useful functionally since
March, 2012, when a decision was made to remove it from ext4.

All traces of EXT4_EOFBLOCKS_FL were removed from e2fsprogs version
1.42.2 by commit 010dc7b90d97 ("e2fsck: remove EXT4_EOFBLOCKS_FL flag
handling") at that time.  Now that enough time has passed to make
e2fsprogs versions containing this modification common, this patch now
removes the code associated with EXT4_EOFBLOCKS_FL from the kernel as
well.

This change has two implications.  First, because pre-1.42.2 e2fsck
versions only look for a problem if EXT4_EOFBLOCKS_FL is set, and
because that bit will never be set by newer kernels containing this
patch, old versions of e2fsck won't have a compatibility problem with
files created by newer kernels.

Second, newer kernels will not clear EXT4_EOFBLOCKS_FL inode flag bits
belonging to a file written by an older kernel.  If set, it will remain
in that state until the file is deleted.  Because e2fsck versions since
1.42.2 don't check the flag at all, no adverse effect is expected.
However, pre-1.42.2 e2fsck versions that do check the flag may report
that it is set when it ought not to be after a file has been truncated
or had its unwritten blocks written.  In this case, the old version of
e2fsck will offer to clear the flag.  No adverse effect would then
occur whether the user chooses to clear the flag or not.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200211210216.24960-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-05 15:55:30 -05:00
Eric Biggers
cb85f4d23f ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:

WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120

Here's a minimal reproducer (not 100% reliable) (root isn't required):

        while true; do
                sync
        done &
        while true; do
                rm -f file
                touch file
                chattr -e file
                echo X >> file
                chattr +e file
        done

The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.

Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).

This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.

Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2020-02-21 19:32:07 -05:00
Eric Biggers
bbd55937de ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.

Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2020-02-21 19:32:07 -05:00
Suraj Jitindar Singh
7c990728b9 ext4: fix potential race between s_flex_groups online resizing and access
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.

The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-21 19:31:46 -05:00
Suraj Jitindar Singh
df3da4ea5a ext4: fix potential race between s_group_info online resizing and access
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
2020-02-21 00:38:12 -05:00
Theodore Ts'o
1d0c3924a9 ext4: fix potential race between online resizing and write operations
During an online resize an array of pointers to buffer heads gets
replaced so it can get enlarged.  If there is a racing block
allocation or deallocation which uses the old array, and the old array
has gotten reused this can lead to a GPF or some other random kernel
memory getting modified.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
Reported-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-21 00:37:09 -05:00
Qian Cai
35df4299a6 ext4: fix a data race in EXT4_I(inode)->i_disksize
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]

 write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
  ext4_write_end+0x4e3/0x750 [ext4]
  ext4_update_i_disksize at fs/ext4/ext4.h:3032
  (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
  (inlined by) ext4_write_end at fs/ext4/inode.c:1287
  generic_perform_write+0x208/0x2a0
  ext4_buffered_write_iter+0x11f/0x210 [ext4]
  ext4_file_write_iter+0xce/0x9e0 [ext4]
  new_sync_write+0x29c/0x3b0
  __vfs_write+0x92/0xa0
  vfs_write+0x103/0x260
  ksys_write+0x9d/0x130
  __x64_sys_write+0x4c/0x60
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
  ext4_writepages+0x10ac/0x1d00 [ext4]
  mpage_map_and_submit_extent at fs/ext4/inode.c:2468
  (inlined by) ext4_writepages at fs/ext4/inode.c:2772
  do_writepages+0x5e/0x130
  __writeback_single_inode+0xeb/0xb20
  writeback_sb_inodes+0x429/0x900
  __writeback_inodes_wb+0xc4/0x150
  wb_writeback+0x4bd/0x870
  wb_workfn+0x6b4/0x960
  process_one_work+0x54c/0xbe0
  worker_thread+0x80/0x650
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G        W  O L 5.5.0-next-20200204+ #5
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
 Workqueue: writeback wb_workfn (flush-7:0)

Since only the read is operating as lockless (outside of the
"i_data_sem"), load tearing could introduce a logic bug. Fix it by
adding READ_ONCE() for the read and WRITE_ONCE() for the write.

Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-19 23:17:02 -05:00
Jan Kara
48a3431195 ext4: fix checksum errors with indexed dirs
DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.

1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"

2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.

Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe8944404 ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-13 11:56:19 -05:00
Theodore Ts'o
71b565ceff ext4: drop ext4_kvmalloc()
As Jan pointed out[1], as of commit 81378da64d ("jbd2: mark the
transaction context with the scope GFP_NOFS context") we use
memalloc_nofs_{save,restore}() while a jbd2 handle is active.  So
ext4_kvmalloc() so we can call allocate using GFP_NOFS is no longer
necessary.

[1] https://lore.kernel.org/r/20200109100007.GC27035@quack2.suse.cz

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200116155031.266620-1-tytso@mit.edu
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:55 -05:00
Eric Biggers
43f816772f ext4: make some functions static in extents.c
Make the following functions static since they're only used in
extents.c:

	__ext4_ext_dirty()
	ext4_can_extents_be_merged()
	ext4_collapse_range()
	ext4_insert_range()

Also remove the prototype for ext4_ext_writepage_trans_blocks(), as this
function is not defined anywhere.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-5-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:54 -05:00
Eric Biggers
dd6683e6ef ext4: remove ext4_{ind,ext}_calc_metadata_amount()
Remove the ext4_ind_calc_metadata_amount() and
ext4_ext_calc_metadata_amount() functions, which have been unused since
commit 71d4f7d032 ("ext4: remove metadata reservation checks").

Also remove the i_da_metadata_calc_last_lblock and
i_da_metadata_calc_len fields from struct ext4_inode_info, as these were
only used by these removed functions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:54 -05:00
Naoto Kobayashi
8f27fd0ab5 ext4: Delete ext4_kvzvalloc()
Since we're not using ext4_kvzalloc(), delete this function.

Signed-off-by: Naoto Kobayashi <naoto.kobayashi4c@gmail.com>
Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Jan Kara
8cd115bdda ext4: Optimize ext4 DIO overwrites
Currently we start transaction for mapping every extent for writing
using direct IO. This is unnecessary when we know we are overwriting
already allocated blocks and the overhead of starting a transaction can
be significant especially for multithreaded workloads doing small writes.
Use iomap operations that avoid starting a transaction for direct IO
overwrites.

This improves throughput of 4k random writes - fio jobfile:
[global]
rw=randrw
norandommap=1
invalidate=0
bs=4k
numjobs=16
time_based=1
ramp_time=30
runtime=120
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32

from 3018MB/s to 4059MB/s in my test VM running test against simulated
pmem device (note that before iomap conversion, this workload was able
to achieve 3708MB/s because old direct IO path avoided transaction start
for overwrites as well). For dax, the win is even larger improving
throughput from 3042MB/s to 4311MB/s.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:57:18 -05:00
Theodore Ts'o
46f870d690 ext4: simulate various I/O and checksum errors when reading metadata
This allows us to test various error handling code paths

Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:28:31 -05:00
Theodore Ts'o
878520ac45 ext4: save the error code which triggered an ext4_error() in the superblock
This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes.  Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:28:23 -05:00
Linus Torvalds
50b8b3f85a This merge window saw the the following new featuers added to ext4:
* Direct I/O via iomap (required the iomap-for-next branch from Darrick
    as a prereq).
  * Support for using dioread-nolock where the block size < page size.
  * Support for encryption for file systems where the block size < page size.
  * Rework of journal credits handling so a revoke-heavy workload will
    not cause the journal to run out of space.
  * Replace bit-spinlocks with spinlocks in jbd2
 
 Also included were some bug fixes and cleanups, mostly to clean up
 corner cases from fuzzed file systems and error path handling.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3dHxoACgkQ8vlZVpUN
 gaMZswf5AbtQhTEJDXO7Pc1ull38GIGFgAv7uAth0TymLC3h1/FEYWW0crEPFsDr
 1Eei55UUVOYrMMUKQ4P7wlLX0cIh3XDPMWnRFuqBoV5/ZOsH/ZSbkY//TG2Xze/v
 9wXIH/RKQnzbRtXffJ1+DnvmXJk+HFm1R1gjl0nfyUXGrnlSfqJxhLSczyd6bJJq
 ehi/tso5UC/4EQsAIdWp7VWsAdaHcZ7ogHqDoy8dXpM1equ408iml7VlKr8R+Nr7
 5ANpCISXChSlLLYm0NYN5vhO8upF5uDxWLdCtxVPL5kFdM2m/ELjXw9h9C+78l7C
 EWJGlGlxvx07Px+e+bfStEsoixpWBg==
 =0eko
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "This merge window saw the the following new featuers added to ext4:

   - Direct I/O via iomap (required the iomap-for-next branch from
     Darrick as a prereq).

   - Support for using dioread-nolock where the block size < page size.

   - Support for encryption for file systems where the block size < page
     size.

   - Rework of journal credits handling so a revoke-heavy workload will
     not cause the journal to run out of space.

   - Replace bit-spinlocks with spinlocks in jbd2

  Also included were some bug fixes and cleanups, mostly to clean up
  corner cases from fuzzed file systems and error path handling"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
  ext4: work around deleting a file with i_nlink == 0 safely
  ext4: add more paranoia checking in ext4_expand_extra_isize handling
  jbd2: make jbd2_handle_buffer_credits() handle reserved handles
  ext4: fix a bug in ext4_wait_for_tail_page_commit
  ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
  ext4: code cleanup for get_next_id
  ext4: fix leak of quota reservations
  ext4: remove unused variable warning in parse_options()
  ext4: Enable encryption for subpage-sized blocks
  fs/buffer.c: support fscrypt in block_read_full_page()
  ext4: Add error handling for io_end_vec struct allocation
  jbd2: Fine tune estimate of necessary descriptor blocks
  jbd2: Provide trace event for handle restarts
  ext4: Reserve revoke credits for freed blocks
  jbd2: Make credit checking more strict
  jbd2: Rename h_buffer_credits to h_total_credits
  jbd2: Reserve space for revoke descriptor blocks
  jbd2: Drop jbd2_space_needed()
  jbd2: Account descriptor blocks into t_outstanding_credits
  jbd2: Factor out common parts of stopping and restarting a handle
  ...
2019-11-30 10:53:02 -08:00
Eric Biggers
b925acb8f8 ext4: add support for IV_INO_LBLK_64 encryption policies
IV_INO_LBLK_64 encryption policies have special requirements from the
filesystem beyond those of the existing encryption policies:

- Inode numbers must never change, even if the filesystem is resized.
- Inode numbers must be <= 32 bits.
- File logical block numbers must be <= 32 bits.

ext4 has 32-bit inode and file logical block numbers.  However,
resize2fs can re-number inodes when shrinking an ext4 filesystem.

However, typically the people who would want to use this format don't
care about filesystem shrinking.  They'd be fine with a solution that
just prevents the filesystem from being shrunk.

Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that
will do exactly that.  Then wire up the fscrypt_operations to expose
this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when
this flag is set.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:42 -08:00
Theodore Ts'o
8d0d47ea16 Merge branch 'mb/dio' into master 2019-11-05 16:21:09 -05:00
Theodore Ts'o
a6d4040846 Merge branch 'jk/jbd2-revoke-overflow' 2019-11-05 16:02:20 -05:00
Jan Kara
83448bdfb5 ext4: Reserve revoke credits for freed blocks
So far we have reserved only relatively high fixed amount of revoke
credits for each transaction. We over-reserved by large amount for most
cases but when freeing large directories or files with data journalling,
the fixed amount is not enough. In fact the worst case estimate is
inconveniently large (maximum extent size) for freeing of one extent.

We fix this by doing proper estimate of the amount of blocks that need
to be revoked when removing blocks from the inode due to truncate or
hole punching and otherwise reserve just a small amount of revoke
credits for each transaction to accommodate freeing of xattrs block or
so.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:49 -05:00
Jan Kara
a413036791 ext4: Provide function to handle transaction restarts
Provide ext4_journal_ensure_credits_fn() function to ensure transaction
has given amount of credits and call helper function to prepare for
restarting a transaction. This allows to remove some boilerplate code
from various places, add proper error handling for the case where
transaction extension or restart fails, and reduces following changes
needed for proper revoke record reservation tracking.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 16:00:48 -05:00
Matthew Bobrowski
378f32bab3 ext4: introduce direct I/O write using iomap infrastructure
This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.

All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().

Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().

For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.

In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().

The existing buffer_head direct I/O implementation has been removed as
it's now redundant.

[ Fix up ext4_dio_write_iter() per Jan's comments at
  https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ]

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 15:53:28 -05:00
Matthew Bobrowski
09edf4d381 ext4: introduce new callback for IOMAP_REPORT
As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05 11:31:40 -05:00
Ritesh Harjani
c8cc88163f ext4: Add support for blocksize < pagesize in dioread_nolock
This patch adds the support for blocksize < pagesize for
dioread_nolock feature.

Since in case of blocksize < pagesize, we can have multiple
small buffers of page as unwritten extents, we need to
maintain a vector of these unwritten extents which needs
the conversion after the IO is complete. Thus, we maintain
a list of tuple <offset, size> pair (io_end_vec) for this &
traverse this list to do the unwritten to written conversion.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191016073711.4141-5-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22 15:32:53 -04:00
Ritesh Harjani
a00713ea98 ext4: Add API to bring in support for unwritten io_end_vec conversion
This patch just brings in the API for conversion of unwritten io_end_vec
extents which will be required for blocksize < pagesize support
for dioread_nolock feature.

No functional changes in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191016073711.4141-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22 15:32:53 -04:00
Linus Torvalds
70cb0d02b5 Added new ext4 debugging ioctls to allow userspace to get information
about the state of the extent status cache.
 
 Dropped workaround for pre-1970 dates which were encoded incorrectly
 in pre-4.4 kernels.  Since both the kernel correctly generates, and
 e2fsck detects and fixes this issue for the past four years, it'e time
 to drop the workaround.  (Also, it's not like files with dates in the
 distant past were all that common in the first place.)
 
 A lot of miscellaneous bug fixes and cleanups, including some ext4
 Documentation fixes.  Also included are two minor bug fixes in
 fs/unicode.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl2D5ZIACgkQ8vlZVpUN
 gaO8NQf+ONLK5nu8KUk14uh8MOXMisiT+g1iqhynZcqtuZzTr4nKqUbHLmPDHrCC
 RiD/gkLhp6u+UlzYRJq6nudunid1be2/1bjoUm6lddE4XLtbeGHhZsGn1+9K/wy+
 l8UFMXd8fCOlXNzajS85Hb0KSuzlrGYEjSrNecSa3KLxrv1kM1+FyKFcqQ7Ejs5/
 VZYNtWo69R4wSEIawGkEZuNu/wFeLOzqJgxFJLo6zFxTAp449bbEduz12ssmkUhl
 QbXH9cXLR4pAZykzMRqHC8UFFTKmpLnc5EiT1Ajxzu4EAzB1SzqRJvbz/3CF3d/Z
 gBKDrDlasv75VJqVtqw4mCxmEoEYjw==
 =Iwrf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Added new ext4 debugging ioctls to allow userspace to get information
  about the state of the extent status cache.

  Dropped workaround for pre-1970 dates which were encoded incorrectly
  in pre-4.4 kernels. Since both the kernel correctly generates, and
  e2fsck detects and fixes this issue for the past four years, it'e time
  to drop the workaround. (Also, it's not like files with dates in the
  distant past were all that common in the first place.)

  A lot of miscellaneous bug fixes and cleanups, including some ext4
  Documentation fixes. Also included are two minor bug fixes in
  fs/unicode"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  unicode: make array 'token' static const, makes object smaller
  unicode: Move static keyword to the front of declarations
  ext4: add missing bigalloc documentation.
  ext4: fix kernel oops caused by spurious casefold flag
  ext4: fix integer overflow when calculating commit interval
  ext4: use percpu_counters for extent_status cache hits/misses
  ext4: fix potential use after free after remounting with noblock_validity
  jbd2: add missing tracepoint for reserved handle
  ext4: fix punch hole for inline_data file systems
  ext4: rework reserved cluster accounting when invalidating pages
  ext4: documentation fixes
  ext4: treat buffers with write errors as containing valid data
  ext4: fix warning inside ext4_convert_unwritten_extents_endio
  ext4: set error return correctly when ext4_htree_store_dirent fails
  ext4: drop legacy pre-1970 encoding workaround
  ext4: add new ioctl EXT4_IOC_GET_ES_CACHE
  ext4: add a new ioctl EXT4_IOC_GETSTATE
  ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE
  jbd2: flush_descriptor(): Do not decrease buffer head's ref count
  ext4: remove unnecessary error check
  ...
2019-09-21 13:37:39 -07:00
Linus Torvalds
cfb82e1df8 y2038: add inode timestamp clamping
This series from Deepa Dinamani adds a per-superblock minimum/maximum
 timestamp limit for a file system, and clamps timestamps as they are
 written, to avoid random behavior from integer overflow as well as having
 different time stamps on disk vs in memory.
 
 At mount time, a warning is now printed for any file system that can
 represent current timestamps but not future timestamps more than 30
 years into the future, similar to the arbitrary 30 year limit that was
 added to settimeofday().
 
 This was picked as a compromise to warn users to migrate to other file
 systems (e.g. ext4 instead of ext3) when they need the file system to
 survive beyond 2038 (or similar limits in other file systems), but not
 get in the way of normal usage.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJdcs20AAoJEJpsee/mABjZaOwQALl3lBEhg0aV6a0ZZ1uYehtd
 vcjZ6OpehfiOAxYJu0wfLPATo4T0FuBxZKz3+trkJDICcxyc68AJ2wijwInIQnZW
 MrSKnPyv/fSGp8Jr5w/0CLdp6yT6Dh7z4j2UxhwusR1bQh4cCYSswDg29/nmxgKp
 Nu8m7jMvJQ2Q0r4Zy0sT/MaycUcSH5yvpyTcsYFixGOz1niNy91ISs1+aq6HZ3i3
 +cuYTUy13y40iNUHzFBTcJItBnikwZOQ/zjNfJFXZ3bVEUPg8ZTLPYQ0OZz+pM0Z
 AlXCKghb2EOKgq729LtA6oaY+Nom/1Gm1p80q3G+nGRVOqRgC+dfAVPZQoiER5Y1
 zNPEDf2Sf7J9xktvfC+Qqa9QEUPLKs22ZIccG+vYBW65sS8IAiEDH3LAt444GGls
 yB/Cx/Qw7BftpR5Om27Mhm5jDQzr43iTkZaPQWq7ydJXpfxnjlg9L19yS1omDFyV
 hdbBXY6FikUICPKUW6I49z5BhjL+kmK9M2DVljImmdKNDTrfr0xY5M/EWjJZ7X+I
 rnSe9qTY+iQ5/AXANn5wfj1Y6L5IxkmdWI/zDIbKhYMZLCqqFLd3mJERbs+CMDJq
 qNrYyFPReFrg50oSduBPAByMTR4x9hus7iIC7r77kpoz5i60DPmIJoTfFm3844Gv
 sBEyvWV08CpE9mSzXuv6
 =em9y
 -----END PGP SIGNATURE-----

Merge tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 vfs updates from Arnd Bergmann:
 "Add inode timestamp clamping.

  This series from Deepa Dinamani adds a per-superblock minimum/maximum
  timestamp limit for a file system, and clamps timestamps as they are
  written, to avoid random behavior from integer overflow as well as
  having different time stamps on disk vs in memory.

  At mount time, a warning is now printed for any file system that can
  represent current timestamps but not future timestamps more than 30
  years into the future, similar to the arbitrary 30 year limit that was
  added to settimeofday().

  This was picked as a compromise to warn users to migrate to other file
  systems (e.g. ext4 instead of ext3) when they need the file system to
  survive beyond 2038 (or similar limits in other file systems), but not
  get in the way of normal usage"

* tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  ext4: Reduce ext4 timestamp warnings
  isofs: Initialize filesystem timestamp ranges
  pstore: fs superblock limits
  fs: omfs: Initialize filesystem timestamp ranges
  fs: hpfs: Initialize filesystem timestamp ranges
  fs: ceph: Initialize filesystem timestamp ranges
  fs: sysv: Initialize filesystem timestamp ranges
  fs: affs: Initialize filesystem timestamp ranges
  fs: fat: Initialize filesystem timestamp ranges
  fs: cifs: Initialize filesystem timestamp ranges
  fs: nfs: Initialize filesystem timestamp ranges
  ext4: Initialize timestamps limits
  9p: Fill min and max timestamps in sb
  fs: Fill in max and min timestamps in superblock
  utimes: Clamp the timestamps before update
  mount: Add mount warning for impending timestamp expiry
  timestamp_truncate: Replace users of timespec64_trunc
  vfs: Add timestamp_truncate() api
  vfs: Add file timestamp range support
2019-09-19 09:42:37 -07:00
Deepa Dinamani
cba465b4f9 ext4: Reduce ext4 timestamp warnings
When ext4 file systems were created intentionally with 128 byte inodes,
the rate-limited warning of eventual possible timestamp overflow are
still emitted rather frequently.  Remove the warning for now.

Discussion for whether any warning is needed,
and where it should be emitted, can be found at
https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/.
I can post a separate follow-up patch after the conclusion.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-04 22:54:53 +02:00
Deepa Dinamani
4881c4971d ext4: Initialize timestamps limits
ext4 has different overflow limits for max filesystem
timestamps based on the extra bytes available.

The timestamp limits are calculated according to the
encoding table in
a4dad1ae24f85i(ext4: Fix handling of extended tv_sec):

* extra  msb of                         adjust for signed
* epoch  32-bit                         32-bit tv_sec to
* bits   time    decoded 64-bit tv_sec  64-bit tv_sec      valid time range
* 0 0    1    -0x80000000..-0x00000001  0x000000000   1901-12-13..1969-12-31
* 0 0    0    0x000000000..0x07fffffff  0x000000000   1970-01-01..2038-01-19
* 0 1    1    0x080000000..0x0ffffffff  0x100000000   2038-01-19..2106-02-07
* 0 1    0    0x100000000..0x17fffffff  0x100000000   2106-02-07..2174-02-25
* 1 0    1    0x180000000..0x1ffffffff  0x200000000   2174-02-25..2242-03-16
* 1 0    0    0x200000000..0x27fffffff  0x200000000   2242-03-16..2310-04-04
* 1 1    1    0x280000000..0x2ffffffff  0x300000000   2310-04-04..2378-04-22
* 1 1    0    0x300000000..0x37fffffff  0x300000000   2378-04-22..2446-05-10

Note that the time limits are not correct for deletion times.

Added a warn when an inode cannot be extended to incorporate an
extended timestamp.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: tytso@mit.edu
Cc: adilger.kernel@dilger.ca
Cc: linux-ext4@vger.kernel.org
2019-08-30 07:27:18 -07:00
zhangyi (F)
7727ae5297 ext4: fix potential use after free after remounting with noblock_validity
Remount process will release system zone which was allocated before if
"noblock_validity" is specified. If we mount an ext4 file system to two
mountpoints with default mount options, and then remount one of them
with "noblock_validity", it may trigger a use after free problem when
someone accessing the other one.

 # mount /dev/sda foo
 # mount /dev/sda bar

User access mountpoint "foo"   |   Remount mountpoint "bar"
                               |
ext4_map_blocks()              |   ext4_remount()
check_block_validity()         |   ext4_setup_system_zone()
ext4_data_block_valid()        |   ext4_release_system_zone()
                               |   free system_blks rb nodes
access system_blks rb nodes    |
trigger use after free         |

This problem can also be reproduced by one mountpint, At the same time,
add_system_zone() can get called during remount as well so there can be
racing ext4_data_block_valid() reading the rbtree at the same time.

This patch add RCU to protect system zone from releasing or building
when doing a remount which inverse current "noblock_validity" mount
option. It assign the rbtree after the whole tree was complete and
do actual freeing after rcu grace period, avoid any intermediate state.

Reported-by: syzbot+1e470567330b7ad711d5@syzkaller.appspotmail.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-08-28 11:13:24 -04:00
Eric Whitney
8fcc3a5806 ext4: rework reserved cluster accounting when invalidating pages
The goal of this patch is to remove two references to the buffer delay
bit in ext4_da_page_release_reservation() as part of a larger effort
to remove all such references from ext4.  These two references are
principally used to reduce the reserved block/cluster count when pages
are invalidated as a result of truncating, punching holes, or
collapsing a block range in a file.  The entire function is removed
and replaced with code in ext4_es_remove_extent() that reduces the
reserved count as a side effect of removing a block range from delayed
and not unwritten extents in the extent status tree as is done when
truncating, punching holes, or collapsing ranges.

The code is written to minimize the number of searches descending from
rb tree roots for scalability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-22 23:22:14 -04:00
ZhangXiaoxu
7963e5ac90 ext4: treat buffers with write errors as containing valid data
I got some errors when I repair an ext4 volume which stacked by an
iscsi target:
    Entry 'test60' in / (2) has deleted/unused inode 73750.  Clear?
It can be reproduced when the network not good enough.

When I debug this I found ext4 will read entry buffer from disk and
the buffer is marked with write_io_error.

If the buffer is marked with write_io_error, it means it already
wroten to journal, and not checked out to disk. IOW, the journal
is newer than the data in disk.
If this journal record 'delete test60', it means the 'test60' still
on the disk metadata.

In this case, if we read the buffer from disk successfully and create
file continue, the new journal record will overwrite the journal
which record 'delete test60', then the entry corruptioned.

So, use the buffer rather than read from disk if the buffer is marked
with write_io_error.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-22 23:00:32 -04:00
Eric Biggers
22cfe4b48c ext4: add fs-verity read support
Make ext4_mpage_readpages() verify data as it is read from fs-verity
files, using the helper functions from fs/verity/.

To support both encryption and verity simultaneously, this required
refactoring the decryption workflow into a generic "post-read
processing" workflow which can do decryption, verification, or both.

The case where the ext4 block size is not equal to the PAGE_SIZE is not
supported yet, since in that case ext4_mpage_readpages() sometimes falls
back to block_read_full_page(), which does not support fs-verity yet.

Co-developed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:33:51 -07:00
Eric Biggers
c93d8f8858 ext4: add basic fs-verity support
Add most of fs-verity support to ext4.  fs-verity is a filesystem
feature that enables transparent integrity protection and authentication
of read-only files.  It uses a dm-verity like mechanism at the file
level: a Merkle tree is used to verify any block in the file in
log(filesize) time.  It is implemented mainly by helper functions in
fs/verity/.  See Documentation/filesystems/fsverity.rst for the full
documentation.

This commit adds all of ext4 fs-verity support except for the actual
data verification, including:

- Adding a filesystem feature flag and an inode flag for fs-verity.

- Implementing the fsverity_operations to support enabling verity on an
  inode and reading/writing the verity metadata.

- Updating ->write_begin(), ->write_end(), and ->writepages() to support
  writing verity metadata pages.

- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().

ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
i_size.  This approach works because (a) verity files are readonly, and
(b) pages fully beyond i_size aren't visible to userspace but can be
read/written internally by ext4 with only some relatively small changes
to ext4.  This approach avoids having to depend on the EA_INODE feature
and on rearchitecturing ext4's xattr support to support paging
multi-gigabyte xattrs into memory, and to support encrypting xattrs.
Note that the verity metadata *must* be encrypted when the file is,
since it contains hashes of the plaintext data.

This patch incorporates work by Theodore Ts'o and Chandan Rajendra.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:33:50 -07:00
Theodore Ts'o
cd2d99229d ext4: drop legacy pre-1970 encoding workaround
Originally, support for expanded timestamps had a bug in that pre-1970
times were erroneously encoded as being in the the 24th century.  This
was fixed in commit a4dad1ae24 ("ext4: Fix handling of extended
tv_sec") which landed in 4.4.  Starting with 4.4, pre-1970 timestamps
were correctly encoded, but for backwards compatibility those
incorrectly encoded timestamps were mapped back to the pre-1970 dates.

Given that backwards compatibility workaround has been around for 4
years, and given that running e2fsck from e2fsprogs 1.43.2 and later
will offer to fix these timestamps (which has been released for 3
years), it's past time to drop the legacy workaround from the kernel.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-12 13:44:49 -04:00
Theodore Ts'o
bb5835edcd ext4: add new ioctl EXT4_IOC_GET_ES_CACHE
For debugging reasons, it's useful to know the contents of the extent
cache.  Since the extent cache contains much of what is in the fiemap
ioctl, use an fiemap-style interface to return this information.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11 16:32:41 -04:00
Theodore Ts'o
1ad3ea6e0a ext4: add a new ioctl EXT4_IOC_GETSTATE
The new ioctl EXT4_IOC_GETSTATE returns some of the dynamic state of
an ext4 inode for debugging purposes.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11 16:31:41 -04:00
Theodore Ts'o
b0c013e292 ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE
The new ioctl EXT4_IOC_CLEAR_ES_CACHE will force an inode's extent
status cache to be cleared out.  This is intended for use for
debugging.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11 16:30:41 -04:00
Theodore Ts'o
7633b08b27 ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
Clean up namespace pollution by the inline_data code.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21 21:57:00 -04:00
Theodore Ts'o
ddce3b9471 ext4: refactor initialize_dirent_tail()
Move the calculation of the location of the dirent tail into
initialize_dirent_tail().  Also prefix the function with ext4_ to fix
kernel namepsace polution.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21 16:31:47 -04:00
Theodore Ts'o
f036adb399 ext4: rename "dirent_csum" functions to use "dirblock"
Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set()
don't actually operate on a directory entry, but a directory block.
And while they take a struct ext4_dir_entry *dirent as an argument, it
had better be the first directory at the beginning of the direct
block, or things will go very wrong.

Rename the following functions so that things make more sense, and
remove a lot of confusing casts along the way:

   ext4_dirent_csum_verify	 -> ext4_dirblock_csum_verify
   ext4_dirent_csum_set		 -> ext4_dirblock_csum_set
   ext4_dirent_csum		 -> ext4_dirblock_csum
   ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21 15:49:26 -04:00
Gabriel Krisman Bertazi
3ae72562ad ext4: optimize case-insensitive lookups
Temporarily cache a casefolded version of the file name under lookup in
ext4_filename, to avoid repeatedly casefolding it.  I got up to 30%
speedup on lookups of large directories (>100k entries), depending on
the length of the string under lookup.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-19 23:45:09 -04:00
Wang Shilong
7ddf79a103 ext4: only set project inherit bit for directory
It doesn't make any sense to have project inherit bits
for regular files, even though this won't cause any
problem, but it is better fix this.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-06-10 00:13:32 -04:00
Linus Torvalds
a9fbcd6728 Clean up fscrypt's dcache revalidation support, and other
miscellaneous cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlzSEfQACgkQ8vlZVpUN
 gaNKrQf+O4JCCc8jqhpvUcNr8+DJNhWYpvRo7yDXoWbAyA6eZHV2fTRX5Vw6T8bW
 iQAj9ofkRnakOq6JvnaUyW8eAuRcqellF7HnwFwTxGOpZ1x3UPAV/roKutAhe8sT
 9dA0VxjugBAISbL2AMQKRPYNuzV07D9As6wZRlPuliFVLLnuPG5SseHRhdn3tm1n
 Jwyipu8P6BjomFtfHT25amISaWRx/uGpjTa1fmjwUxIC8EI6V9K6hKNCAUPsk/3g
 m8zEBpBKSmPK66sFPGxddPNGYAyyFluUboQxB7DuSCF7J3cULO8TxRZbsW/5jaio
 ZR8utWezuXnrI80vG/VtCMhqG3398Q==
 =0Bak
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Clean up fscrypt's dcache revalidation support, and other
  miscellaneous cleanups"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: cache decrypted symlink target in ->i_link
  vfs: use READ_ONCE() to access ->i_link
  fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext
  fscrypt: only set dentry_operations on ciphertext dentries
  fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
  fscrypt: fix race allowing rename() and link() of ciphertext dentries
  fscrypt: clean up and improve dentry revalidation
  fscrypt: use READ_ONCE() to access ->i_crypt_info
  fscrypt: remove WARN_ON_ONCE() when decryption fails
  fscrypt: drop inode argument from fscrypt_get_ctx()
2019-05-07 21:28:04 -07:00