linux/fs
Dave Chinner 32baa63d82 xfs: logging the on disk inode LSN can make it go backwards
When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c4 ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c4, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c4 or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c4 ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
..
9p 9p for 5.13-rc1 2021-05-07 11:18:52 -07:00
adfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
affs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
afs netfslib fixes 2021-06-25 09:41:29 -07:00
autofs autofs: should_expire() argument is guaranteed to be positive 2021-03-24 14:14:27 -04:00
befs fs/befs: Delete obsolete TODO file 2021-03-30 16:54:49 -07:00
bfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
btrfs Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
cachefiles fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
ceph ceph: take reference to req->r_parent at point of assignment 2021-06-29 00:15:52 +02:00
cifs cifs: update internal version number 2021-07-09 13:03:36 -05:00
coda coda: fix reference counting in coda_file_mmap error path 2021-04-23 14:42:39 -07:00
configfs configfs updates for Linux 5.13 2021-07-02 14:13:21 -07:00
cramfs cramfs: use %pD instead of messing with file_dentry()->d_name 2021-01-05 23:02:47 -05:00
crypto fscrypt: fix derivation of SipHash keys on big endian CPUs 2021-06-05 00:52:52 -07:00
debugfs Linux 5.13-rc6 2021-06-14 09:07:45 +02:00
devpts
dlm fs: dlm: invalid buffer access in lookup error 2021-06-11 12:44:47 -05:00
ecryptfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
efivarfs efivars: convert to fileattr 2021-04-12 15:04:29 +02:00
efs
erofs erofs: clean up file headers & footers 2021-06-08 00:41:24 +08:00
exfat Description for this pull request: 2021-07-06 11:06:04 -07:00
exportfs exportfs: Add a function to return the raw output from fh_to_dentry() 2020-12-09 09:39:38 -05:00
ext2 fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
ext4 Ext4 regression and bug fixes for v5.14-rc1 2021-07-09 09:57:27 -07:00
f2fs f2fs: drop dirty node pages when cp is in error status 2021-07-06 22:05:06 -07:00
fat mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
freevxfs
fscache fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
fuse fuse update for 5.14 2021-07-06 11:17:41 -07:00
gfs2 Various minor gfs2 cleanups and fixes 2021-06-29 20:23:08 -07:00
hfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
hfsplus hfsplus: report create_date to kstat.btime 2021-07-01 11:06:06 -07:00
hostfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
hpfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
hugetlbfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-06-28 20:39:26 -07:00
iomap Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
isofs isofs: remove redundant continue statement 2021-06-17 17:11:42 +02:00
jbd2 ext4: inline jbd2_journal_[un]register_shrinker() 2021-07-08 08:37:31 -04:00
jffs2 This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
jfs JFS fixes for 5.14 2021-07-02 14:25:17 -07:00
kernfs Driver core changes for 5.14-rc1 2021-07-05 13:51:41 -07:00
lockd lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream 2021-07-06 20:14:44 -04:00
minix mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
netfs netfs: fix test for whether we can skip read when writing beyond EOF 2021-06-21 21:24:07 +01:00
nfs NFS client updates for Linux 5.14 2021-07-09 09:43:57 -07:00
nfs_common nfs_common: fix doc warning 2021-07-06 20:14:41 -04:00
nfsd block-5.14-2021-07-08 2021-07-09 12:05:33 -07:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
nls
notify fanotify: fix copy_event_to_user() fid error clean up 2021-06-14 12:16:37 +02:00
ntfs Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
ocfs2 In addition to bug fixes and cleanups, there are two new features for 2021-06-30 19:37:39 -07:00
omfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
openpromfs openpromfs: don't do unlock_new_inode() until the new inode is set up 2021-03-12 22:15:22 -05:00
orangefs orangefs: fix orangefs df output. 2021-06-28 08:40:08 -04:00
overlayfs overlayfs update for 5.13 2021-04-30 15:17:08 -07:00
proc Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
pstore for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
qnx4
qnx6
quota quota: remove unnecessary oom message 2021-06-22 10:40:52 +02:00
ramfs fs: move ramfs_aops to libfs 2021-06-29 10:53:48 -07:00
reiserfs \n 2021-07-01 12:06:39 -07:00
romfs
squashfs squashfs: add option to panic on errors 2021-06-29 10:53:46 -07:00
sysfs sysfs: Support zapping of binary attr mmaps 2021-01-12 14:26:31 +01:00
sysv mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
tracefs tracing: Fix various typos in comments 2021-03-23 14:08:18 -04:00
ubifs ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode 2021-06-22 09:21:39 +02:00
udf \n 2021-07-01 12:06:39 -07:00
ufs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
unicode .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
vboxsf vboxsf: don't allow to change the inode type 2021-03-12 22:15:00 -05:00
verity fsverity: relax build time dependency on CRYPTO_SHA256 2021-04-22 17:31:32 +10:00
xfs xfs: logging the on disk inode LSN can make it go backwards 2021-07-29 09:27:29 -07:00
zonefs mm: move page dirtying prototypes from mm.h 2021-06-29 10:53:48 -07:00
aio.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
anon_inodes.c fs: anon_inodes: rephrase to appropriate kernel-doc 2021-01-15 12:17:25 -05:00
attr.c ima: handle idmapped mounts 2021-01-24 14:27:20 +01:00
bad_inode.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
binfmt_aout.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_elf_fdpic.c Merge branch 'akpm' (patches from Andrew) 2021-06-29 17:29:11 -07:00
binfmt_elf.c Merge branch 'akpm' (patches from Andrew) 2021-06-29 17:29:11 -07:00
binfmt_em86.c
binfmt_flat.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-13 11:27:30 -08:00
binfmt_script.c
block_dev.c Char / Misc driver updates for 5.14-rc1 2021-07-05 13:42:16 -07:00
buffer.c mm/writeback: move __set_page_dirty() to core mm 2021-06-29 10:53:48 -07:00
char_dev.c
compat_binfmt_elf.c get rid of COMPAT_ELF_EXEC_PAGESIZE 2021-01-06 08:42:51 -05:00
coredump.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
d_path.c getcwd(2): clean up error handling 2021-05-18 20:15:58 -04:00
dax.c dax: fix ENOMEM handling in grab_mapping_entry() 2021-06-29 10:53:47 -07:00
dcache.c useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-06 19:24:13 -07:00
exec.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
fcntl.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
fhandle.c switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
file_table.c
file.c Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-03 11:05:28 -07:00
filesystems.c
fs_context.c
fs_parser.c vfs: fs_parser: clean up kernel-doc warnings 2021-04-30 11:20:35 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c \n 2021-07-01 12:06:39 -07:00
fsopen.c
init.c init: handle idmapped mounts 2021-01-24 14:27:19 +01:00
inode.c mm: remove nrexceptional from inode: remove BUG_ON 2021-05-05 11:27:20 -07:00
internal.h switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
io_uring.c io_uring: remove dead non-zero 'poll' check 2021-07-09 08:20:28 -06:00
io-wq.c io_uring: fix false WARN_ONCE 2021-06-18 09:22:02 -06:00
io-wq.h io_uring: move creds from io-wq work to io_kiocb 2021-06-18 09:22:02 -06:00
ioctl.c vfs: add fileattr ops 2021-04-12 15:04:23 +02:00
Kconfig mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
Kconfig.binfmt binfmt_flat: allow not offsetting data start 2021-04-19 09:56:37 +10:00
kernel_read_file.c switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
libfs.c fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
locks.c Additional fixes and clean-ups for NFSD since tags/nfsd-5.13, 2021-05-05 13:44:19 -07:00
Makefile netfs: Provide readahead and readpage netfs helpers 2021-04-23 10:14:32 +01:00
mbcache.c
mount.h mount: make {lock,unlock}_mount_hash() static 2021-01-24 14:29:34 +01:00
mpage.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
namei.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
namespace.c mount: Support "nosymfollow" in new mount api 2021-06-01 12:09:27 +02:00
no-block.c
nsfs.c
open.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
pipe.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-08 15:18:43 +01:00
posix_acl.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
proc_namespace.c fs: introduce MOUNT_ATTR_IDMAP 2021-01-24 14:43:45 +01:00
read_write.c teach sendfile(2) to handle send-to-pipe directly 2021-01-25 23:29:36 -05:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c ioctl: handle idmapped mounts 2021-01-24 14:27:19 +01:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-16 22:13:10 +01:00
seq_file.c seq_file: drop unused *_escape_mem_ascii() 2021-07-01 11:06:05 -07:00
signalfd.c signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo 2021-05-18 16:20:54 -05:00
splice.c for-5.12/block-2021-02-17 2021-02-21 11:02:48 -08:00
stack.c
stat.c fs: fix reporting supported extra file attributes for statx() 2021-04-17 23:03:50 -04:00
statfs.c s390,alpha: switch to 64-bit ino_t 2021-02-13 17:17:53 +01:00
super.c block: move bd_mutex to struct gendisk 2021-06-01 07:44:32 -06:00
sync.c
timerfd.c
userfaultfd.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
utimes.c utimes: handle idmapped mounts 2021-01-24 14:27:18 +01:00
xattr.c xattr: fix kernel-doc for mnt_userns and vfs xattr helpers 2021-03-23 11:20:26 +01:00