btrfs-progs/mkfs
Qu Wenruo b2a1be83b8 btrfs-progs: mkfs: keep file descriptors open during whole time
[BUG]
There is an internal bug report that, after mkfs.btrfs there is a chance
that no /dev/disk/by-uuid/<uuid> symlink is not created at all.

[CAUSE]
That uuid symlink is created by udev, which listens to inotify
IN_CLOSE_WRITE events from all block devices.

After such IN_CLOSE_WRITE event is triggered, udev would *disable*
inotify for that block device, and do a blkid scan on it.
After the blkid scan is done, re-enables the inotify listening.

This means normally mkfs tools should open the fd, do all the writes,
and close the fd after everything is done.

But unfortunately for mkfs.btrfs, it's not the case, we have a lot of
phases separated by different close() calls:

  open_ctree() would open fds of each involved device
  and close them at close_ctree()
  Only after close_ctree() we have a valid superblock -\
                                                       |
|<------- A -------->|<--------- B --------->|<------- C ------->|
          |                      |
          |                      `- open a new fd for make_btrfs()
          |                         and close it before open_ctree()
          |                         The device contains invalid sb.
          |
          `- open a new fd for each device, then call
             btrfs_prepare_device(), then close the fd.
             The device would contain no valid superblock.

If at the close() of phase A udev event is triggered, while doing udev
scan we go into phase C (but before the new valid super blocks written),
udev would only see no superblock or invalid superblock.

Then phase C finished, udev resumes its inotify listening, but at this
time mkfs is finished, while udev only sees the premature data from
phase A, and misses the IN_CLOSE_WRITE events from phase C.

[FIX]
Instead of opening and closing a new fd for each device, re-use the fd
opened during prepare_one_device(), and close all the fds until
close_ctree() is called.

By this, although we may still have race between close_ctree() and
explicit close() calls, at least udev can always see the properly
written super blocks.

To compensate the change, some extra cleanups are made:

- Do not touch @device_count
  Which makes later prepare_ctx iteration much easier.

- Remove top-level @fd variable
  Instead go with prepare_ctx[i].fd.

- Do not open with O_RDWR in test_dev_for_mkfs()
  as test_dev_for_mkfs() would close the fd, if we go O_RDWR, it can
  cause the udev race.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-25 16:59:41 +02:00
..
common.c btrfs-progs: mkfs: keep file descriptors open during whole time 2023-04-25 16:59:41 +02:00
common.h btrfs-progs: fsfeatures: properly merge -O and -R options 2022-10-11 09:08:11 +02:00
main.c btrfs-progs: mkfs: keep file descriptors open during whole time 2023-04-25 16:59:41 +02:00
Makefile btrfs-progs: build: add stub makefile to image and mkfs 2019-07-04 15:36:01 +02:00
rootdir.c btrfs-progs: mkfs: offset inode numbers of the source filesystem 2022-10-11 09:08:10 +02:00
rootdir.h btrfs-progs: mkfs: update include lists 2022-10-11 09:06:12 +02:00