When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.
Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.
Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
In order to determine whether a caller holds privilege over a given
inode the capability framework exposes the two helpers
privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
verifies that the inode has a mapping in the caller's user namespace and
the latter additionally verifies that the caller has the requested
capability in their current user namespace.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped inodes. If the initial user namespace is passed all
operations are a nop so non-idmapped mounts will not see a change in
behavior.
Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Enable unprivileged user namespace mounts of overlayfs. Overlayfs's
permission model (*) ensures that the mounter itself cannot gain additional
privileges by the act of creating an overlayfs mount.
This feature request is coming from the "rootless" container crowd.
(*) Documentation/filesystems/overlayfs.txt#Permission model
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When looking up an inode on the lower layer for which the mounter lacks
read permisison the metacopy check will fail. This causes the lookup to
fail as well, even though the directory is readable.
So ignore EACCES for the "userxattr" case and assume no metacopy for the
unreadable file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In case the file cannot be opened with O_NOATIME because of lack of
capabilities, then clear O_NOATIME instead of failing.
Remove WARN_ON(), since it would now trigger if O_NOATIME was cleared.
Noticed by Amir Goldstein.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Comment above call already says this, but only EOPNOTSUPP is ignored, other
failures are not.
For example setting "user.*" will fail with EPERM on symlink/special.
Ignore this error as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Optionally allow using "user.overlay." namespace instead of
"trusted.overlay."
This is necessary for overlayfs to be able to be mounted in an unprivileged
namepsace.
Make the option explicit, since it makes the filesystem format be
incompatible.
Disable redirect_dir and metacopy options, because these would allow
privilege escalation through direct manipulation of the
"user.overlay.redirect" or "user.overlay.metacopy" xattrs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
generic_file_splice_read() and iter_file_splice_write() will call back into
f_op->iter_read() and f_op->iter_write() respectively. These already do
the real file lookup and cred override. So the code in ovl_splice_read()
and ovl_splice_write() is redundant.
In addition the ovl_file_accessed() call in ovl_splice_write() is
incorrect, though probably harmless.
Fix by calling generic_file_splice_read() and iter_file_splice_write()
directly.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_ioctl_set_flags() does a capability check using flags, but then the
real ioctl double-fetches flags and uses potentially different value.
The "Check the capability before cred override" comment misleading: user
can skip this check by presenting benign flags first and then overwriting
them to non-benign flags.
Just remove the cred override for now, hoping this doesn't cause a
regression.
The proper solution is to create a new setxflags i_op (patches are in the
works).
Xfstests don't show a regression.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: dab5ca8fd9 ("ovl: add lsattr/chattr support")
Cc: <stable@vger.kernel.org> # v4.19
CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in
ovl_decode_real_fh() as well to prevent privilege escalation for
unprivileged overlay mounts.
[Amir] If the mounter is not capable in init ns, ovl_check_origin() and
ovl_verify_index() will not function as expected and this will break index
and nfs export features. So check capability in ovl_can_decode_fh(), to
auto disable those features.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In metacopy case, we should use ovl_inode_realdata() instead of
ovl_inode_real() to get real inode which has data, so that
we can get correct information of extentes in ->fiemap operation.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When the lower file of a metacopy is inaccessible, -EIO is returned. For
users not familiar with overlayfs internals, such as myself, the meaning of
this error may not be apparent or easy to determine, since the (metacopy)
file is present and open/stat succeed when accessed outside of the overlay.
Add a rate-limited warning for orphan metacopy to give users a hint when
investigating such errors.
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxi23Zsmfb4rCed1n=On0NNA5KZD74jjjeyz+et32sk-gg@mail.gmail.com/
Signed-off-by: Kevin Locke <kevin@kevinlocke.name>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This replaces uuid with null in overlayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there is
only one filesystem for all the work/upper/lower directories and bare file
handles from this backing filesystem are unique. In other case when we have
multiple filesystems lets just fallback to "uuid=on" which is and
equivalent of how it worked before with all uuid checks.
This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with the
original one. This way the "copy" container can't have the same uuid on the
superblock and mounting the overlayfs from it later would fail.
That is an example of the problem on top of loop+ext4:
dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
#/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0
mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
#mount: /loop-test/loop-mp/merged:
#mount(2) system call failed: Stale file handle.
If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
create the copy of container and we require fs uuid to be unique for a new
container.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This will be used in next patch to be able to change uuid checks and add
uuid nullification based on ofs->config.index for a new "uuid=off" mode.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Canonalize to ioctl FS_* flags instead of inode S_* flags.
Note that we do not call the helper vfs_ioc_fssetxattr_check()
for FS_IOC_FSSETXATTR ioctl. The reason is that underlying filesystem
will perform all the checks. We only need to perform the capability
check before overriding credentials.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
[S|G]ETFLAGS and FS[S|G]ETXATTR ioctls are applicable to both files and
directories, so add ioctl operations to dir as well.
We teach ovl_real_fdget() to get the realfile of directories which use
a different type of file->private_data.
Ifdef away compat ioctl implementation to conform to standard practice.
With this change, xfstest generic/079 which tests these ioctls on files
and directories passes.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_can_list() should return false for overlay private xattrs. Since
currently these use the "trusted.overlay." prefix, they will always match
the "trusted." prefix as well, hence the test for being non-trusted will
not trigger.
Prepare for using the "user.overlay." namespace by moving the test for
private xattr before the test for non-trusted.
This patch doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of passing the xattr name down to the ovl_do_*xattr() accessor
functions, pass an enumerated value. The enum can use the same names as
the the previous #define for each xattr name.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr()
otherwise.
This has an effect on debug output, which is made more consistent by this
patch.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the convention of calling ovl_do_foo() for operations which are overlay
specific.
This patch is a no-op, and will have significance for supporting
"user.overlay." xattr namespace.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is a partial revert (with some cleanups) of commit 993a0b2aec ("ovl:
Do not lose security.capability xattr over metadata file copy-up"), which
introduced ovl_getxattr() in the first place.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Lose the padding and the failure message (in line with other parts of the
copy up process). Return zero for both nonexistent or empty xattr.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_getattr() returns the value of an xattr in a kmalloced buffer. There
are two callers:
ovl_copy_up_meta_inode_data() (copy_up.c)
ovl_get_redirect_xattr() (util.c)
This patch just copies ovl_getxattr() to copy_up.c, the following patches
will deal with the differences in idividual callers.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build requirement
is such that they don't care if a node goes down while build was still
going on. In that case, they will simply throw away unfinished layer and
start new build. So they don't care about syncing intermediate state to the
disk and hence don't want to pay the price associated with sync.
So they are asking for mount options where they can disable sync on overlay
mount point.
They primarily seem to have two use cases.
- For building images, they will mount overlay with nosync and then sync
upper layer after unmounting overlay and reuse upper as lower for next
layer.
- For running containers, they don't seem to care about syncing upper layer
because if node goes down, they will simply throw away upper layer and
create a fresh one.
So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if system
crashes or shuts down and start fresh.
With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to 25
seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take out the
network operations latency out of the measurement.
Giuseppe is also looking to cut down on number of iops done on the disk. He
is complaining that often in cloud their VMs are throttled if they cross
the limit. This option can help them where they reduce number of iops (by
cutting down on frequent sync and writebacks).
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
An incompatible feature is marked by a non-empty directory nested
2 levels deep under "work" dir, e.g.:
workdir/work/incompat/volatile.
This commit checks for marked incompat features, warns about them
and fails to mount the overlay, for example:
overlayfs: overlay with incompat feature 'volatile' cannot be mounted
Very old kernels (i.e. v3.18) will fail to remove a non-empty "work"
dir and fail the mount. Newer kernels will fail to remove a "work"
dir with entries nested 3 levels and fall back to read-only mount.
User mounting with old kernel will see a warning like these in dmesg:
overlayfs: cleanup of 'incompat/...' failed (-39)
overlayfs: cleanup of 'work/incompat' failed (-39)
overlayfs: cleanup of 'ovl-work/work' failed (-39)
overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17);
mounting read-only
These warnings should give the hint to the user that:
1. mount failure is caused by backward incompatible features
2. mount failure can be resolved by manually removing the "work" directory
There is nothing preventing users on old kernels from manually removing
workdir entirely or mounting overlay with a new workdir, so this is in
no way a full proof backward compatibility enforcement, but only a best
effort.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup().
When looking up an overlay dentry, upperdentry may be found by index
and not by name. In that case, we fail to read the metacopy xattr
and falsly set the OVL_UPPERDATA on the overlay inode.
This caused a regression in xfstest overlay/033 when run with
OVERLAY_MOUNT_OPTIONS="-o metacopy=on".
Fixes: 28166ab3c8 ("ovl: initialize OVL_UPPERDATA in ovl_lookup()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The check if user has changed the overlay file was wrong, causing unneeded
call to ovl_change_flags() including taking f_lock on every file access.
Fixes: d989903058 ("ovl: do not generate duplicate fsnotify events for "fake" path")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Without upperdir mount option, there is no index dir and the dependency
checks nfs_export => index for mount options parsing are incorrect.
Allow the combination nfs_export=on,index=off with no upperdir and move
the check for dependency redirect_dir=nofollow for non-upper mount case
to mount options parsing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index feature enabled, on failure to create index dir, overlay is
being mounted read-only. However, we do not forbid user to remount overlay
read-write. Fix that by setting ofs->workdir to NULL, which prevents
remount read-write.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of lower
fs") relaxed the requirement for non null uuid with single lower layer to
allow enabling index and nfs_export features with single lower squashfs.
Fabian reported a regression in a setup when overlay re-uses an existing
upper layer and re-formats the lower squashfs image. Because squashfs
has no uuid, the origin xattr in upper layer are decoded from the new
lower layer where they may resolve to a wrong origin file and user may
get an ESTALE or EIO error on lookup.
To avoid the reported regression while still allowing the new features
with single lower squashfs, do not allow decoding origin with lower null
uuid unless user opted-in to one of the new features that require
following the lower inode of non-dir upper (index, xino, metacopy).
Reported-by: Fabian <godi.beat@gmx.net>
Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid of lower fs")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic
since v5.8-rc1 overlayfs updates.
overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2)
BUG: kernel NULL pointer dereference, address: 0000000000000030
RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay]
Bisect point at commit c21c839b84 ("ovl: whiteout inode sharing")
Minimal reproducer:
--------------------------------------------------
rm -rf l u w m
mkdir -p l u w m
mkdir -p l/testdir
touch l/testdir/testfile
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
echo 1 > m/testdir/testfile
umount m
rm -rf u/testdir
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
umount m
--------------------------------------------------
When mount with nfs_export=on, and fail to verify an orphan index, we're
cleaning this index from indexdir by calling ovl_cleanup_and_whiteout().
This dereferences ofs->workdir, that was earlier set to NULL.
The design was that ovl->workdir will point at ovl->indexdir, but we are
assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup().
There is no reason not to do it sooner, because once we get success from
ofs->indexdir = ovl_workdir_create(... there is no turning back.
Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: c21c839b84 ("ovl: whiteout inode sharing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Decoding a lower directory file handle to overlay path with cold
inode/dentry cache may go as follows:
1. Decode real lower file handle to lower dir path
2. Check if lower dir is indexed (was copied up)
3. If indexed, get the upper dir path from index
4. Lookup upper dir path in overlay
5. If overlay path found, verify that overlay lower is the lower dir
from step 1
On failure to verify step 5 above, user will get an ESTALE error and a
WARN_ON will be printed.
A mismatch in step 5 could be a result of lower directory that was renamed
while overlay was offline, after that lower directory has been copied up
and indexed.
This is a scripted reproducer based on xfstest overlay/052:
# Create lower subdir
create_dirs
create_test_files $lower/lowertestdir/subdir
mount_dirs
# Copy up lower dir and encode lower subdir file handle
touch $SCRATCH_MNT/lowertestdir
test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
# Rename lower dir offline
unmount_dirs
mv $lower/lowertestdir $lower/lowertestdir.new/
mount_dirs
# Attempt to decode lower subdir file handle
test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle
Since this WARN_ON() can be triggered by user we need to relax it.
Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
Cc: <stable@vger.kernel.org> # v4.16+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_check_origin outparam 'ctrp' argument not used by caller. So remove
this argument.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
"ovl_copy_up_flags" is used in copy_up.c.
so, change it static.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt9klAAKCRDh3BK/laaZ
PBeeAP9GRI0yajPzBzz2ZK9KkDc6A7wPiaAec+86Q+c02VncVwEAvq5Pi4um5RTZ
7SVv56ggKO3Cqx779zVyZTRYDs3+YA4=
=bpKI
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"Fixes:
- Resolve mount option conflicts consistently
- Sync before remount R/O
- Fix file handle encoding corner cases
- Fix metacopy related issues
- Fix an unintialized return value
- Add missing permission checks for underlying layers
Optimizations:
- Allow multipe whiteouts to share an inode
- Optimize small writes by inheriting SB_NOSEC from upper layer
- Do not call ->syncfs() multiple times for sync(2)
- Do not cache negative lookups on upper layer
- Make private internal mounts longterm"
* tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits)
ovl: remove unnecessary lock check
ovl: make oip->index bool
ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
ovl: make private mounts longterm
ovl: get rid of redundant members in struct ovl_fs
ovl: add accessor for ofs->upper_mnt
ovl: initialize error in ovl_copy_xattr
ovl: drop negative dentry in upper layer
ovl: check permission to open real file
ovl: call secutiry hook in ovl_real_ioctl()
ovl: verify permissions in ovl_path_open()
ovl: switch to mounter creds in readdir
ovl: pass correct flags for opening real directory
ovl: fix redirect traversal on metacopy dentries
ovl: initialize OVL_UPPERDATA in ovl_lookup()
ovl: use only uppermetacopy state in ovl_lookup()
ovl: simplify setting of origin for index lookup
ovl: fix out of bounds access warning in ovl_check_fb_len()
ovl: return required buffer size for file handles
ovl: sync dirty data when remounting to ro mode
...
Directory is always locked until "out_unlock" label. So lock check is not
needed.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
* Clean up fiemap handling in ext4
* Clean up and refactor multiple block allocator (mballoc) code
* Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
* Fixed a race in ext4_sync_parent() versus rename()
* Simplify the error handling in the extent manipulation code
* Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
ext4_make_inode_dirty()'s callers.
* Avoid passing an error pointer to brelse in ext4_xattr_set()
* Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
* Fix refcount handling if ext4_iget() fails
* Fix a crash in generic/019 caused by a corrupted extent node
-----BEGIN PGP SIGNATURE-----
iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
/pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
vsxsfTYMG18+2SxrJ9LwmagqmrRq
=HMTs
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of bug fixes and cleanups for ext4, including:
- Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
- Clean up fiemap handling in ext4
- Clean up and refactor multiple block allocator (mballoc) code
- Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
- Fixed a race in ext4_sync_parent() versus rename()
- Simplify the error handling in the extent manipulation code
- Make sure all metadata I/O errors are felected to
ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.
- Avoid passing an error pointer to brelse in ext4_xattr_set()
- Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
- Fix refcount handling if ext4_iget() fails
- Fix a crash in generic/019 caused by a corrupted extent node"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: avoid unnecessary transaction starts during writeback
ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
fs: remove the access_ok() check in ioctl_fiemap
fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
fs: move fiemap range validation into the file systems instances
iomap: fix the iomap_fiemap prototype
fs: move the fiemap definitions out of fs.h
fs: mark __generic_block_fiemap static
ext4: remove the call to fiemap_check_flags in ext4_fiemap
ext4: split _ext4_fiemap
ext4: fix fiemap size checks for bitmap files
ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
add comment for ext4_dir_entry_2 file_type member
jbd2: avoid leaking transaction credits when unreserving handle
ext4: drop ext4_journal_free_reserved()
ext4: mballoc: use lock for checking free blocks while retrying
ext4: mballoc: refactor ext4_mb_good_group()
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
ext4: mballoc: refactor ext4_mb_discard_preallocations()
...
Overlayfs is using clone_private_mount() to create internal mounts for
underlying layers. These are used for operations requiring a path, such as
dentry_open().
Since these private mounts are not in any namespace they are treated as
short term, "detached" mounts and mntput() involves taking the global
mount_lock, which can result in serious cacheline pingpong.
Make these private mounts longterm instead, which trade the penalty on
mntput() for a slightly longer shutdown time due to an added RCU grace
period when putting these mounts.
Introduce a new helper kern_unmount_many() that can take care of multiple
longterm mounts with a single RCU grace period.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
used instead of a separate ->upperdir_trap.
Split the lowerdir option early to get the number of layers, then allocate
the ->layers array, and finally fill the upper and lower layers, as before.
Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
take care of that.
[Colin Ian King] Fix null pointer dereference on null stack pointer on
error return found by Coverity.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
xattrs, the copy loop will terminate without assigning anything to the
error variable, thus returning an uninitialized value.
If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
value is put into a pointer by ERR_PTR(), causing potential invalid memory
accesses down the line.
This commit initialize error with 0. This is the correct value because when
there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
should succeed.
This bug is discovered with the help of INIT_STACK_ALL and clang.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
Fixes: 0956254a2d ("ovl: don't copy up opaqueness")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
handled once instead of duplicated, but can still be done under fs locks,
like xfs/iomap intended with its duplicate handling. Also make sure the
error value of filemap_write_and_wait is propagated to user space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
No need to pull the fiemap definitions into almost every file in the
kernel build.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Negative dentries of upper layer are useless after construction of
overlayfs' own dentry and may keep in the memory long time even after
unmount of overlayfs instance. This patch tries to drop unnecessary
negative dentry of upper layer to effectively reclaim memory.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call inode_permission() on real inode before opening regular file on one of
the underlying layers.
In some cases ovl_permission() already checks access to an underlying file,
but it misses the metacopy case, and possibly other ones as well.
Removing the redundant permission check from ovl_permission() should be
considered later.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do
it.
[Stephen Rothwell] export security_file_ioctl
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Check permission before opening a real file.
ovl_path_open() is used by readdir and copy-up routines.
ovl_permission() theoretically already checked copy up permissions, but it
doesn't hurt to re-do these checks during the actual copy-up.
For directory reading ovl_permission() only checks access to topmost
underlying layer. Readdir on a merged directory accesses layers below the
topmost one as well. Permission wasn't checked for these layers.
Note: modifying ovl_permission() to perform this check would be far more
complex and hence more bug prone. The result is less precise permissions
returned in access(2). If this turns out to be an issue, we can revisit
this bug.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In preparation for more permission checking, override credentials for
directory operations on the underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The three instances of ovl_path_open() in overlayfs/readdir.c do three
different things:
- pass f_flags from overlay file
- pass O_RDONLY | O_DIRECTORY
- pass just O_RDONLY
The value of f_flags can be (other than O_RDONLY):
O_WRONLY - not possible for a directory
O_RDWR - not possible for a directory
O_CREAT - masked out by dentry_open()
O_EXCL - masked out by dentry_open()
O_NOCTTY - masked out by dentry_open()
O_TRUNC - masked out by dentry_open()
O_APPEND - no effect on directory ops
O_NDELAY - no effect on directory ops
O_NONBLOCK - no effect on directory ops
__O_SYNC - no effect on directory ops
O_DSYNC - no effect on directory ops
FASYNC - no effect on directory ops
O_DIRECT - no effect on directory ops
O_LARGEFILE - ?
O_DIRECTORY - only affects lookup
O_NOFOLLOW - only affects lookup
O_NOATIME - overlay sets this unconditionally in ovl_path_open()
O_CLOEXEC - only affects fd allocation
O_PATH - no effect on directory ops
__O_TMPFILE - not possible for a directory
Fon non-merge directories we use the underlying filesystem's iterate; in
this case honor O_LARGEFILE from the original file to make sure that open
doesn't get rejected.
For merge directories it's safe to pass O_LARGEFILE unconditionally since
userspace will only see the artificial offsets created by overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Amir pointed me to metacopy test cases in unionmount-testsuite and I
decided to run "./run --ov=10 --meta" and it failed while running test
"rename-mass-5.py".
Problem is w.r.t absolute redirect traversal on intermediate metacopy
dentry. We do not store intermediate metacopy dentries and also skip
current loop/layer and move onto lookup in next layer. But at the end of
loop, we have logic to reset "poe" and layer index if currnently looked up
dentry has absolute redirect. We skip all that and that means lookup in
next layer will fail.
Following is simple test case to reproduce this.
- mkdir -p lower upper work merged lower/a lower/b
- touch lower/a/foo.txt
- mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged
# Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt.
- mv merged/a/foo.txt merged/b/bar.txt
# unmount overlay and use upper as lower layer (lower2) for next mount.
- umount merged
- mv upper lower2
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
# Force a metacopy copy-up
- chown bin:bin merged/b/bar.txt
# unmount overlay and use upper as lower layer (lower3) for next mount.
- umount merged
- mv upper lower3
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
# ls merged/b/bar.txt
ls: cannot access 'bar.txt': Input/output error
Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with
absolute redirect "/a/foo.txt". We skipped redirect processing at the end
of loop which sets poe to roe and sets the appropriate next lower layer
index. And that means lookup failed in next layer.
Fix this by continuing the loop for any intermediate dentries. We still do
not save these at lower stack. With this fix applied unionmount-testsuite,
"./run --ov-10 --meta" now passes.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it
has to call ovl_check_metacopy_xattr() and check if metacopy xattr is
present or not.
yangerkun reported sometimes underlying filesystem might return -EIO and in
that case error handling path does not cleanup properly leading to various
warnings.
Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug
as below(linux 4.19):
[ 551.001349] overlayfs: failed to get metacopy (-5)
[ 551.003464] overlayfs: failed to get inode (-5)
[ 551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5)
[ 551.004941] overlayfs: failed to get origin (-5)
[ 551.005199] ------------[ cut here ]------------
[ 551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400
...
[ 551.027219] Call Trace:
[ 551.027623] ovl_create_object+0x13f/0x170
[ 551.028268] ovl_create+0x27/0x30
[ 551.028799] path_openat+0x1a35/0x1ea0
[ 551.029377] do_filp_open+0xad/0x160
[ 551.029944] ? vfs_writev+0xe9/0x170
[ 551.030499] ? page_counter_try_charge+0x77/0x120
[ 551.031245] ? __alloc_fd+0x160/0x2a0
[ 551.031832] ? do_sys_open+0x189/0x340
[ 551.032417] ? get_unused_fd_flags+0x34/0x40
[ 551.033081] do_sys_open+0x189/0x340
[ 551.033632] __x64_sys_creat+0x24/0x30
[ 551.034219] do_syscall_64+0xd5/0x430
[ 551.034800] entry_SYSCALL_64_after_hwframe+0x44/0xa9
One solution is to improve error handling and call iget_failed() if error
is encountered. Amir thinks that this path is little intricate and there
is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode().
Instead caller of ovl_get_inode() can initialize this state. And this will
avoid double checking of metacopy xattr lookup in ovl_lookup() and
ovl_get_inode().
OVL_UPPERDATA is inode flag. So I was little concerned that initializing
it outside ovl_get_inode() might have some races. But this is one way
transition. That is once a file has been fully copied up, it can't go back
to metacopy file again. And that seems to help avoid races. So as of now
I can't see any races w.r.t OVL_UPPERDATA being set wrongly. So move
settingof OVL_UPPERDATA inside the callers of ovl_get_inode().
ovl_obtain_alias() already does it. So only two callers now left are
ovl_lookup() and ovl_instantiate().
Reported-by: yangerkun <yangerkun@huawei.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently we use a variable "metacopy" which signifies that dentry could be
either uppermetacopy or lowermetacopy. Amir suggested that we can move
code around and use d.metacopy in such a way that we don't need
lowermetacopy and just can do away with uppermetacopy.
So this patch replaces "metacopy" with "uppermetacopy".
It also moves some code little higher to keep reading little simpler.
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
overlayfs can keep index of copied up files and directories and it seems to
serve two primary puroposes. For regular files, it avoids breaking lower
hardlinks over copy up. For directories it seems to be used for various
error checks.
During ovl_lookup(), we lookup for index using lower dentry in many a
cases. That lower dentry is called "origin" and following is a summary of
current logic.
If there is no upperdentry, always lookup for index using lower dentry.
For regular files it helps avoiding breaking hard links over copyup and for
directories it seems to be just error checks.
If there is an upperdentry, then there are 3 possible cases.
- For directories, lower dentry is found using two ways. One is regular
path based lookup in lower layers and second is using ORIGIN xattr on
upper dentry. First verify that path based lookup lower dentry matches
the one pointed by upper ORIGIN xattr. If yes, use this verified origin
for index lookup.
- For regular files (non-metacopy), there is no path based lookup in lower
layers as lookup stops once we find upper dentry. So there is no origin
verification. If there is ORIGIN xattr present on upper, use that to
lookup index otherwise don't.
- For regular metacopy files, again lower dentry is found using path based
lookup as well as ORIGIN xattr on upper. Path based lookup is continued
in this case to find lower data dentry for metacopy upper. So like
directories we only use verified origin. If ORIGIN xattr is not present
(Either because lower did not support file handles or because this is
hardlink copied up with index=off), then don't use path lookup based
lower dentry as origin. This is same as regular non-metacopy file case.
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
syzbot reported out of bounds memory access from open_by_handle_at()
with a crafted file handle that looks like this:
{ .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }
handle_bytes gets rounded down to 0 and we end up calling:
ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)
But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
fh + 3 is illegal.
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
set from Mauro toward the completion of the RST conversion. I *really*
hope we are getting close to the end of this. Meanwhile, those patches
reach pretty far afield to update document references around the tree;
there should be no actual code changes there. There will be, alas, more of
the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots of
fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
+NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
=fDC/
-----END PGP SIGNATURE-----
Merge tag 'docs-5.8' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A fair amount of stuff this time around, dominated by yet another
massive set from Mauro toward the completion of the RST conversion. I
*really* hope we are getting close to the end of this. Meanwhile,
those patches reach pretty far afield to update document references
around the tree; there should be no actual code changes there. There
will be, alas, more of the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots
of fixes"
* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
Documentation: fixes to the maintainer-entry-profile template
zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
tracing: Fix events.rst section numbering
docs: acpi: fix old http link and improve document format
docs: filesystems: add info about efivars content
Documentation: LSM: Correct the basic LSM description
mailmap: change email for Ricardo Ribalda
docs: sysctl/kernel: document unaligned controls
Documentation: admin-guide: update bug-hunting.rst
docs: sysctl/kernel: document ngroups_max
nvdimm: fixes to maintainter-entry-profile
Documentation/features: Correct RISC-V kprobes support entry
Documentation/features: Refresh the arch support status files
Revert "docs: sysctl/kernel: document ngroups_max"
docs: move locking-specific documents to locking/
docs: move digsig docs to the security book
docs: move the kref doc into the core-api book
docs: add IRQ documentation at the core-api book
docs: debugging-via-ohci1394.txt: add it to the core-api book
docs: fix references for ipmi.rst file
...
Overlayfs doesn't work well with the fanotify mechanism.
Fanotify first probes for the required buffer size for the file handle,
but overlayfs currently bails out without passing the size back.
That results in errors in the kernel log, such as:
[527944.485384] overlayfs: failed to encode file handle (/, err=-75, buflen=0, len=29, type=1)
[527944.485386] fanotify: failed to encode fid (fsid=ae521e68.a434d95f, type=255, bytes=0, err=-2)
Signed-off-by: Lubos Dolezel <lubos@dolezel.info>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
sync_filesystem() does not sync dirty data for readonly filesystem during
umount, so before changing to readonly filesystem we should sync dirty data
for data integrity.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Share inode with different whiteout files for saving inode and speeding up
delete operation.
If EMLINK is encountered when linking a shared whiteout, create a new one.
In case of any other error, disable sharing for this super block.
Note: ofs->whiteout is protected by inode lock on workdir.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since the stacking of regular file operations [1], the overlayfs edition of
write_iter() is called when writing regular files.
Since then, xattr lookup is needed on every write since file_remove_privs()
is called from ovl_write_iter(), which would become the performance
bottleneck when writing small chunks of data. In my test case,
file_remove_privs() would consume ~15% CPU when running fstime of unixbench
(the workload is repeadly writing 1 KB to the same file) [2].
Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be
done only once on the first write. Unixbench fstime gets a ~20% performance
gain with this patch.
[1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/
[2] https://www.spinics.net/lists/linux-unionfs/msg07153.html
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.
During global sync() each overlayfs instance calls method ->sync_fs() for
backend although it itself is in global list of superblocks too. As a
result one syscall sync() could write one superblock several times and send
multiple disk barriers.
This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index=on, let index dir act as the work dir for copy up and cleanups.
This will help implementing whiteout inode sharing.
We still create the "work" dir on mount regardless of index=on and it is
used to test the features supported by upper fs. One reason is that before
the feature tests, we do not know if index could be enabled or not.
The reason we do not use "index" directory also as workdir with index=off
is because the existence of the "index" directory acts as a simple
persistent signal that index was enabled on this filesystem and tools may
want to use that signal.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index=on, we copy up lower hardlinks to work dir and move them into
index dir. Fix locking to allow work dir and index dir to be the same
directory.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Teach ovl_indexdir_cleanup() to remove temp directories containing
whiteouts to prepare for using index dir instead of work dir for removing
merge directories.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Similar to the way that a conflict between metacopy=on,redirect_dir=off is
resolved, also resolve conflicts between nfs_export=on,index=off and
nfs_export=on,metacopy=on.
An explicit mount option wins over a default config value. Both explicit
mount options result in an error.
Without this change the xfstests group overlay/exportfs are skipped if
metacopy is enabled by default.
Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The "buflen" value comes from the user and there is a potential that it
could be zero. In do_handle_to_path() we know that "handle->handle_bytes"
is non-zero and we do:
handle_dwords = handle->handle_bytes >> 2;
So values 1-3 become zero. Then in ovl_fh_to_dentry() we do:
int len = fh_len << 2;
So now len is in the "0,4-128" range and a multiple of 4. But if
"buflen" is zero it will try to copy negative bytes when we do the
memcpy in ovl_fid_to_fh().
memcpy(&fh->fb, fid, buflen - OVL_FH_WIRE_OFFSET);
And that will lead to a crash. Thanks to Amir Goldstein for his help
with this patch.
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As of now during open(), we don't pass bunch of flags to underlying
filesystem. O_TRUNC is one of these. Normally this is not a problem as VFS
calls ->setattr() with zero size and underlying filesystem sets file size
to 0.
But when overlayfs is running on top of virtiofs, it has an optimization
where it does not send setattr request to server if dectects that
truncation is part of open(O_TRUNC). It assumes that server already zeroed
file size as part of open(O_TRUNC).
fuse_do_setattr() {
if (attr->ia_valid & ATTR_OPEN) {
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
* need to truncate page cache.
*/
}
}
IOW, fuse expects O_TRUNC to be passed to it as part of open flags.
But currently overlayfs does not pass O_TRUNC to underlying filesystem
hence fuse/virtiofs breaks. Setup overlayfs on top of virtiofs and
following does not zero the file size of a file is either upper only or has
already been copied up.
fd = open(foo.txt, O_TRUNC | O_WRONLY);
There are two ways to fix this. Either pass O_TRUNC to underlying
filesystem or clear ATTR_OPEN from attr->ia_valid so that fuse ends up
sending a SETATTR request to server. Miklos is concerned that O_TRUNC might
have side affects so it is better to clear ATTR_OPEN for now. Hence this
patch clears ATTR_OPEN from attr->ia_valid.
I found this problem while running unionmount-testsuite. With this patch,
unionmount-testsuite passes with overlayfs on top of virtiofs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_setattr() can be passed an attr which has ATTR_FILE set and
attr->ia_file is a file pointer to overlay file. This is done in
open(O_TRUNC) path.
We should either replace with attr->ia_file with underlying file object or
clear ATTR_FILE so that underlying filesystem does not end up using
overlayfs file object pointer.
There are no good use cases yet so for now clear ATTR_FILE. fuse seems to
be one user which can use this. But it can work even without this. So it
is not mandatory to pass ATTR_FILE to fuse.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Several references got broken due to txt to ReST conversion.
Several of them can be automatically fixed with:
scripts/documentation-file-ref-check --fix
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> # hwtracing/coresight/Kconfig
Reviewed-by: Paul E. McKenney <paulmck@kernel.org> # memory-barrier.txt
Acked-by: Alex Shi <alex.shi@linux.alibaba.com> # translations/zh_CN
Acked-by: Federico Vaga <federico.vaga@vaga.pv.it> # translations/it_IT
Acked-by: Marc Zyngier <maz@kernel.org> # kvm/arm64
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/6f919ddb83a33b5f2a63b6b5f0575737bb2b36aa.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
So far, with xino=auto, we only enable xino if we know that all
underlying filesystem use 32bit inode numbers.
When users configure overlay with xino=auto, they already declare that
they are ready to handle 64bit inode number from overlay.
It is a very common case, that underlying filesystem uses 64bit ino,
but rarely or never uses the high inode number bits (e.g. tmpfs, xfs).
Leaving it for the users to declare high ino bits are unused with
xino=on is not a recipe for many users to enjoy the benefits of xino.
There appears to be very little reason not to enable xino when users
declare xino=auto even if we do not know how many bits underlying
filesystem uses for inode numbers.
In the worst case of xino bits overflow by real inode number, we
already fall back to the non-xino behavior - real inode number with
unique pseudo dev or to non persistent inode number and overlay st_dev
(for directories).
The only annoyance from auto enabling xino is that xino bits overflow
emits a warning to kmsg. Suppress those warnings unless users explicitly
asked for xino=on, suggesting that they expected high ino bits to be
unused by underlying filesystem.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When xino feature is enabled and a real directory inode number overflows
the lower xino bits, we cannot map this directory inode number to a unique
and persistent inode number and we fall back to the real inode st_ino and
overlay st_dev.
The real inode st_ino with high bits may collide with a lower inode number
on overlay st_dev that was mapped using xino.
To avoid possible collision with legitimate xino values, map a non
persistent inode number to a dedicated range in the xino address space.
The dedicated range is created by adding one more bit to the number of
reserved high xino bits. We could have added just one more fsid, but that
would have had the undesired effect of changing persistent overlay inode
numbers on kernel or require more complex xino mapping code.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There is no reason to deplete the system's global get_next_ino() pool for
overlay non-persistent inode numbers and there is no reason at all to
allocate non-persistent inode numbers for non-directories.
For non-directories, it is much better to leave i_ino the same as real
i_ino, to be consistent with st_ino/d_ino.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Changes to underlying layers should not cause WARN_ON(), but this repro
does:
mkdir w l u mnt
sudo mount -t overlay -o workdir=w,lowerdir=l,upperdir=u overlay mnt
touch mnt/h
ln u/h u/k
rm -rf mnt/k
rm -rf mnt/h
dmesg
------------[ cut here ]------------
WARNING: CPU: 1 PID: 116244 at fs/inode.c:302 drop_nlink+0x28/0x40
After upper hardlinks were added while overlay is mounted, unlinking all
overlay hardlinks drops overlay nlink to zero before all upper inodes
are unlinked.
After unlink/rename prevent i_nlink from going to zero if there are still
hashed aliases (i.e. cached hard links to the victim) remaining.
Reported-by: Phasip <phasip@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare variable-length
types such as these ones is a flexible array member[1][2], introduced in
C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by this
change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The situation is the same as for __d_obtain_alias() (which is what that
thing is parallel to) - if we find a preexisting alias, we want to grab it,
drop the inode and return the alias we'd found.
The only thing d_instantiate_anon() does compared to that is spurious
security_d_instiate() that has already been done to that dentry with exact
same arguments.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlayfs works sub-optimally with upper fs that has no xattr/d_type/
RENAME_WHITEOUT support. We should basically deprecate support for those
filesystems, but so far, we only issue a warning and don't fail the mount
for the sake of backward compat. Some features are already being disabled
with no xattr support.
For newly supported remote upper fs, we do not need to worry about backward
compatibility, so we can fail the mount if upper fs is a sub-optimal
filesystem.
This reduces the in-tree remote filesystems supported as upper to just
FUSE, for which the remote upper fs support was added.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As with other required upper fs features, we only warn if support is
missing to avoid breaking existing sub-optimal setups.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No reason to prevent upper layer being a remote filesystem. Do the
revalidation in that case, just as we already do for lower layers.
This lets virtiofs be used as upper layer, which appears to be a real use
case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow completely skipping ->revalidate() on a per-dentry basis, in case the
underlying layers used for a dentry do not themselves have ->revalidate().
E.g. negative overlay dentry has no underlying layers, hence revalidate is
unnecessary. Or if lower layer is remote but overlay dentry is pure-upper,
then can skip revalidate.
The following places need to update whether the dentry needs revalidate or
not:
- fill-super (root dentry)
- lookup
- create
- fh_to_dentry
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Following patch will allow remote as upper layer, but not overlay stacked
on upper layer. Separate the two concepts.
This patch is doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use a common loop for plain and weak revalidation. This will aid doing
revalidation on upper layer.
This patch doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This issue came up with NFSv4 as the lower layer, which generates
"system.nfs4_acl" xattrs (even for plain old unix permissions). Prior to
this patch this prevented copy-up from succeeding.
The overlayfs permission model mandates that permissions are checked
locally for the task and remotely for the mounter(*). NFS4 ACLs are not
supported by the Linux kernel currently, hence they cannot be enforced
locally. Which means it is indifferent whether this attribute is copied or
not.
Generalize this to any xattr that is not used in access checking (i.e. it's
not a POSIX ACL and not in the "security." namespace).
Incidentally, best effort copying of xattrs seems to also be the behavior
of "cp -a", which is what overlayfs tries to mimic.
(*) Documentation/filesystems/overlayfs.txt#Permission model
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move i_ino initialization to ovl_inode_init() to avoid the dance of setting
i_ino in ovl_fill_inode() sometimes on the first call and sometimes on the
seconds call.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allocates and initializes the root dentry and inode.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_inode_update() is no longer called from create object code path.
Fixes: 01b39dcc95 ("ovl: use inode_insert5() to hash a newly...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 6dde1e42f4 ("ovl: make i_ino consistent with st_ino in more
cases"), relaxed the condition nfs_export=on in order to set the value of
i_ino to xino map of real ino.
Specifically, it also relaxed the pre-condition that index=on for
consistent i_ino. This opened the corner case of lower hardlink in
ovl_get_inode(), which calls ovl_fill_inode() with ino=0 and then
ovl_init_inode() is called to set i_ino to lower real ino without the xino
mapping.
Pass the correct values of ino;fsid in this case to ovl_fill_inode(), so it
can initialize i_ino correctly.
Fixes: 6dde1e42f4 ("ovl: make i_ino consistent with st_ino in more ...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Lockdep reports "WARNING: lock held when returning to user space!" due to
async write holding freeze lock over the write. Apparently aio.c already
deals with this by lying to lockdep about the state of the lock.
Do the same here. No need to check for S_IFREG() here since these file ops
are regular-only.
Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com
Fixes: 2406a307ac ("ovl: implement async IO routines")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix up two bugs in the coversion to xino_mode:
1. xino=off does not always end up in disabled mode
2. xino=auto on 32bit arch should end up in disabled mode
Take a proactive approach to disabling xino on 32bit kernel:
1. Disable XINO_AUTO config during build time
2. Disable xino with a warning on mount time
As a by product, xino=on on 32bit arch also ends up in disabled mode.
We never intended to enable xino on 32bit arch and this will make the
rest of the logic simpler.
Fixes: 0f831ec85e ("ovl: simplify ovl_same_sb() helper")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.
Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.
Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_lseek() is using ssize_t to return the value from vfs_llseek(). On a
32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB.
Assign the return value of vfs_llseek() to loff_t to fix this.
Reported-by: Boris Gjenero <boris.gjenero@gmail.com>
Fixes: 9e46b840c7 ("ovl: support stacked SEEK_HOLE/SEEK_DATA")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>