License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2009-12-18 10:24:29 +08:00
|
|
|
#include <linux/fanotify.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/fcntl.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/file.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/fs.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/anon_inodes.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/fsnotify_backend.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/init.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/mount.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/namei.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/poll.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/syscalls.h>
|
2010-05-19 23:36:28 +08:00
|
|
|
#include <linux/slab.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/types.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
#include <linux/uaccess.h>
|
2013-03-06 09:10:59 +08:00
|
|
|
#include <linux/compat.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2019-01-11 01:04:36 +08:00
|
|
|
#include <linux/statfs.h>
|
|
|
|
#include <linux/exportfs.h>
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
#include <asm/ioctls.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
|
2011-11-25 15:35:16 +08:00
|
|
|
#include "../../mount.h"
|
2012-12-18 08:05:12 +08:00
|
|
|
#include "../fdinfo.h"
|
2014-01-22 07:48:14 +08:00
|
|
|
#include "fanotify.h"
|
2011-11-25 15:35:16 +08:00
|
|
|
|
2010-10-29 05:21:57 +08:00
|
|
|
#define FANOTIFY_DEFAULT_MAX_EVENTS 16384
|
2021-03-04 19:29:20 +08:00
|
|
|
#define FANOTIFY_OLD_DEFAULT_MAX_MARKS 8192
|
|
|
|
#define FANOTIFY_DEFAULT_MAX_GROUPS 128
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Legacy fanotify marks limits (8192) is per group and we introduced a tunable
|
|
|
|
* limit of marks per user, similar to inotify. Effectively, the legacy limit
|
|
|
|
* of fanotify marks per user is <max marks per group> * <max groups per user>.
|
|
|
|
* This default limit (1M) also happens to match the increased limit of inotify
|
|
|
|
* max_user_watches since v5.10.
|
|
|
|
*/
|
|
|
|
#define FANOTIFY_DEFAULT_MAX_USER_MARKS \
|
|
|
|
(FANOTIFY_OLD_DEFAULT_MAX_MARKS * FANOTIFY_DEFAULT_MAX_GROUPS)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Most of the memory cost of adding an inode mark is pinning the marked inode.
|
|
|
|
* The size of the filesystem inode struct is not uniform across filesystems,
|
|
|
|
* so double the size of a VFS inode is used as a conservative approximation.
|
|
|
|
*/
|
|
|
|
#define INODE_MARK_COST (2 * sizeof(struct inode))
|
|
|
|
|
|
|
|
/* configurable via /proc/sys/fs/fanotify/ */
|
|
|
|
static int fanotify_max_queued_events __read_mostly;
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
|
2021-07-30 14:28:54 +08:00
|
|
|
static long ft_zero = 0;
|
|
|
|
static long ft_int_max = INT_MAX;
|
|
|
|
|
2021-03-04 19:29:20 +08:00
|
|
|
struct ctl_table fanotify_table[] = {
|
|
|
|
{
|
|
|
|
.procname = "max_user_groups",
|
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_FANOTIFY_GROUPS],
|
2021-07-30 14:28:54 +08:00
|
|
|
.maxlen = sizeof(long),
|
2021-03-04 19:29:20 +08:00
|
|
|
.mode = 0644,
|
2021-07-30 14:28:54 +08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
.extra1 = &ft_zero,
|
|
|
|
.extra2 = &ft_int_max,
|
2021-03-04 19:29:20 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_user_marks",
|
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_FANOTIFY_MARKS],
|
2021-07-30 14:28:54 +08:00
|
|
|
.maxlen = sizeof(long),
|
2021-03-04 19:29:20 +08:00
|
|
|
.mode = 0644,
|
2021-07-30 14:28:54 +08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
.extra1 = &ft_zero,
|
|
|
|
.extra2 = &ft_int_max,
|
2021-03-04 19:29:20 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_queued_events",
|
|
|
|
.data = &fanotify_max_queued_events,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = SYSCTL_ZERO
|
|
|
|
},
|
|
|
|
{ }
|
|
|
|
};
|
|
|
|
#endif /* CONFIG_SYSCTL */
|
2010-10-29 05:21:57 +08:00
|
|
|
|
fanotify: check file flags passed in fanotify_init
Without this patch fanotify_init does not validate the value passed in
event_f_flags.
When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.
Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.
Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539
This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10
With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.
Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.
Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.
This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.
O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.
O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.
__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:05:44 +08:00
|
|
|
/*
|
|
|
|
* All flags that may be specified in parameter event_f_flags of fanotify_init.
|
|
|
|
*
|
|
|
|
* Internal and external open flags are stored together in field f_flags of
|
|
|
|
* struct file. Only external open flags shall be allowed in event_f_flags.
|
|
|
|
* Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
|
|
|
|
* excluded.
|
|
|
|
*/
|
|
|
|
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
|
|
|
|
O_ACCMODE | O_APPEND | O_NONBLOCK | \
|
|
|
|
__O_SYNC | O_DSYNC | O_CLOEXEC | \
|
|
|
|
O_LARGEFILE | O_NOATIME )
|
|
|
|
|
2009-12-18 10:24:29 +08:00
|
|
|
extern const struct fsnotify_ops fanotify_fsnotify_ops;
|
2009-12-18 10:24:25 +08:00
|
|
|
|
2016-12-22 01:06:12 +08:00
|
|
|
struct kmem_cache *fanotify_mark_cache __read_mostly;
|
2020-03-25 00:04:20 +08:00
|
|
|
struct kmem_cache *fanotify_fid_event_cachep __read_mostly;
|
|
|
|
struct kmem_cache *fanotify_path_event_cachep __read_mostly;
|
2014-04-04 05:46:33 +08:00
|
|
|
struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2019-01-11 01:04:35 +08:00
|
|
|
#define FANOTIFY_EVENT_ALIGN 4
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
#define FANOTIFY_INFO_HDR_LEN \
|
|
|
|
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
|
2019-01-11 01:04:35 +08:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
static int fanotify_fid_info_len(int fh_len, int name_len)
|
2020-03-19 23:10:20 +08:00
|
|
|
{
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
int info_len = fh_len;
|
|
|
|
|
|
|
|
if (name_len)
|
|
|
|
info_len += name_len + 1;
|
|
|
|
|
|
|
|
return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
|
2020-03-19 23:10:20 +08:00
|
|
|
}
|
|
|
|
|
2020-07-16 16:42:28 +08:00
|
|
|
static int fanotify_event_info_len(unsigned int fid_mode,
|
|
|
|
struct fanotify_event *event)
|
2019-01-11 01:04:35 +08:00
|
|
|
{
|
2020-07-16 16:42:17 +08:00
|
|
|
struct fanotify_info *info = fanotify_event_info(event);
|
|
|
|
int dir_fh_len = fanotify_event_dir_fh_len(event);
|
2020-03-24 23:55:37 +08:00
|
|
|
int fh_len = fanotify_event_object_fh_len(event);
|
2020-07-16 16:42:17 +08:00
|
|
|
int info_len = 0;
|
2020-07-16 16:42:28 +08:00
|
|
|
int dot_len = 0;
|
2020-07-16 16:42:17 +08:00
|
|
|
|
2020-07-16 16:42:28 +08:00
|
|
|
if (dir_fh_len) {
|
2020-07-16 16:42:17 +08:00
|
|
|
info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
|
2020-07-16 16:42:28 +08:00
|
|
|
} else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
|
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_NAME, if name was not recorded in
|
|
|
|
* event on a directory, we will report the name ".".
|
|
|
|
*/
|
|
|
|
dot_len = 1;
|
|
|
|
}
|
2020-03-24 23:55:37 +08:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (fh_len)
|
2020-07-16 16:42:28 +08:00
|
|
|
info_len += fanotify_fid_info_len(fh_len, dot_len);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
|
|
|
|
return info_len;
|
2019-01-11 01:04:35 +08:00
|
|
|
}
|
|
|
|
|
2021-03-04 18:48:25 +08:00
|
|
|
/*
|
|
|
|
* Remove an hashed event from merge hash table.
|
|
|
|
*/
|
|
|
|
static void fanotify_unhash_event(struct fsnotify_group *group,
|
|
|
|
struct fanotify_event *event)
|
|
|
|
{
|
|
|
|
assert_spin_locked(&group->notification_lock);
|
|
|
|
|
|
|
|
pr_debug("%s: group=%p event=%p bucket=%u\n", __func__,
|
|
|
|
group, event, fanotify_event_hash_bucket(group, event));
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(hlist_unhashed(&event->merge_list)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
hlist_del_init(&event->merge_list);
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/*
|
2020-03-25 00:04:20 +08:00
|
|
|
* Get an fanotify notification event if one exists and is small
|
2009-12-18 10:24:26 +08:00
|
|
|
* enough to fit in "count". Return an error pointer if the count
|
2019-01-08 21:02:44 +08:00
|
|
|
* is not large enough. When permission event is dequeued, its state is
|
|
|
|
* updated accordingly.
|
2009-12-18 10:24:26 +08:00
|
|
|
*/
|
2020-03-25 00:04:20 +08:00
|
|
|
static struct fanotify_event *get_one_event(struct fsnotify_group *group,
|
2009-12-18 10:24:26 +08:00
|
|
|
size_t count)
|
|
|
|
{
|
2019-01-11 01:04:35 +08:00
|
|
|
size_t event_size = FAN_EVENT_METADATA_LEN;
|
2020-03-25 00:04:20 +08:00
|
|
|
struct fanotify_event *event = NULL;
|
2021-03-04 18:48:22 +08:00
|
|
|
struct fsnotify_event *fsn_event;
|
2020-07-16 16:42:28 +08:00
|
|
|
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
|
|
|
|
|
2019-01-08 20:52:31 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2021-03-04 18:48:22 +08:00
|
|
|
fsn_event = fsnotify_peek_first_event(group);
|
|
|
|
if (!fsn_event)
|
2019-01-08 20:52:31 +08:00
|
|
|
goto out;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2021-03-04 18:48:22 +08:00
|
|
|
event = FANOTIFY_E(fsn_event);
|
|
|
|
if (fid_mode)
|
|
|
|
event_size += fanotify_event_info_len(fid_mode, event);
|
2019-01-11 01:04:35 +08:00
|
|
|
|
2019-01-08 20:52:31 +08:00
|
|
|
if (event_size > count) {
|
2020-03-25 00:04:20 +08:00
|
|
|
event = ERR_PTR(-EINVAL);
|
2019-01-08 20:52:31 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2021-03-04 18:48:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Held the notification_lock the whole time, so this is the
|
|
|
|
* same event we peeked above.
|
|
|
|
*/
|
|
|
|
fsnotify_remove_first_event(group);
|
2020-03-25 00:04:20 +08:00
|
|
|
if (fanotify_is_perm_event(event->mask))
|
|
|
|
FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
|
2021-03-04 18:48:25 +08:00
|
|
|
if (fanotify_is_hashed_event(event->mask))
|
|
|
|
fanotify_unhash_event(group, event);
|
2019-01-08 20:52:31 +08:00
|
|
|
out:
|
|
|
|
spin_unlock(&group->notification_lock);
|
2020-03-25 00:04:20 +08:00
|
|
|
return event;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
|
|
|
|
2020-03-24 22:27:52 +08:00
|
|
|
static int create_fd(struct fsnotify_group *group, struct path *path,
|
2014-01-22 07:48:14 +08:00
|
|
|
struct file **file)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
|
|
|
int client_fd;
|
|
|
|
struct file *new_file;
|
|
|
|
|
fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
According to commit 80af258867648 ("fanotify: groups can specify their
f_flags for new fd"), file descriptors created as part of file access
notification events inherit flags from the event_f_flags argument passed
to syscall fanotify_init(2)[1].
Unfortunately O_CLOEXEC is currently silently ignored.
Indeed, event_f_flags are only given to dentry_open(), which only seems to
care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in
open_check_o_direct() and O_LARGEFILE in generic_file_open().
It's a pity, since, according to some lookup on various search engines and
http://codesearch.debian.net/, there's already some userspace code which
use O_CLOEXEC:
- in systemd's readahead[2]:
fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);
- in clsync[3]:
#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)
int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);
- in examples [4] from "Filesystem monitoring in the Linux
kernel" article[5] by Aleksander Morgado:
if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)
Additionally, since commit 48149e9d3a7e ("fanotify: check file flags
passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init()
second argument is expressly allowed.
So it seems expected to set close-on-exec flag on the file descriptors if
userspace is allowed to request it with O_CLOEXEC.
But Andrew Morton raised[6] the concern that enabling now close-on-exec
might break existing applications which ask for O_CLOEXEC but expect the
file descriptor to be inherited across exec().
In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file
descriptor returned as part of file access notify can break applications
due to deadlock. So close-on-exec is needed for most applications.
More, applications asking for close-on-exec are likely expecting it to be
enabled, relying on O_CLOEXEC being effective. If not, it might weaken
their security, as noted by Jan Kara[8].
So this patch replaces call to macro get_unused_fd() by a call to function
get_unused_fd_flags() with event_f_flags value as argument. This way
O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is
interpreted and close-on-exec get enabled when requested.
[1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org
[7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l
[8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz
Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Mihai Don\u021bu <mihai.dontu@gmail.com>
Cc: Pádraig Brady <P@draigBrady.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:24:40 +08:00
|
|
|
client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
|
2009-12-18 10:24:26 +08:00
|
|
|
if (client_fd < 0)
|
|
|
|
return client_fd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we need a new file handle for the userspace program so it can read even if it was
|
|
|
|
* originally opened O_WRONLY.
|
|
|
|
*/
|
2020-03-24 22:27:52 +08:00
|
|
|
new_file = dentry_open(path,
|
|
|
|
group->fanotify_data.f_flags | FMODE_NONOTIFY,
|
|
|
|
current_cred());
|
2009-12-18 10:24:26 +08:00
|
|
|
if (IS_ERR(new_file)) {
|
|
|
|
/*
|
|
|
|
* we still send an event even if we can't open the file. this
|
|
|
|
* can happen when say tasks are gone and we try to open their
|
|
|
|
* /proc files or we try to open a WRONLY file like in sysfs
|
|
|
|
* we just send the errno to userspace since there isn't much
|
|
|
|
* else we can do.
|
|
|
|
*/
|
|
|
|
put_unused_fd(client_fd);
|
|
|
|
client_fd = PTR_ERR(new_file);
|
|
|
|
} else {
|
2012-08-20 00:30:45 +08:00
|
|
|
*file = new_file;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
return client_fd;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
|
|
|
|
2019-01-08 21:02:44 +08:00
|
|
|
/*
|
|
|
|
* Finish processing of permission event by setting it to ANSWERED state and
|
|
|
|
* drop group->notification_lock.
|
|
|
|
*/
|
|
|
|
static void finish_permission_event(struct fsnotify_group *group,
|
|
|
|
struct fanotify_perm_event *event,
|
|
|
|
unsigned int response)
|
|
|
|
__releases(&group->notification_lock)
|
|
|
|
{
|
2019-01-08 22:18:02 +08:00
|
|
|
bool destroy = false;
|
|
|
|
|
2019-01-08 21:02:44 +08:00
|
|
|
assert_spin_locked(&group->notification_lock);
|
|
|
|
event->response = response;
|
2019-01-08 22:18:02 +08:00
|
|
|
if (event->state == FAN_EVENT_CANCELED)
|
|
|
|
destroy = true;
|
|
|
|
else
|
|
|
|
event->state = FAN_EVENT_ANSWERED;
|
2019-01-08 21:02:44 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2019-01-08 22:18:02 +08:00
|
|
|
if (destroy)
|
|
|
|
fsnotify_destroy_event(group, &event->fae.fse);
|
2019-01-08 21:02:44 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
static int process_access_response(struct fsnotify_group *group,
|
|
|
|
struct fanotify_response *response_struct)
|
|
|
|
{
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_perm_event *event;
|
2014-04-04 05:46:33 +08:00
|
|
|
int fd = response_struct->fd;
|
|
|
|
int response = response_struct->response;
|
2009-12-18 10:24:34 +08:00
|
|
|
|
|
|
|
pr_debug("%s: group=%p fd=%d response=%d\n", __func__, group,
|
|
|
|
fd, response);
|
|
|
|
/*
|
|
|
|
* make sure the response is valid, if invalid we do nothing and either
|
2011-03-31 09:57:33 +08:00
|
|
|
* userspace can send a valid response or we will clean it up after the
|
2009-12-18 10:24:34 +08:00
|
|
|
* timeout
|
|
|
|
*/
|
2017-10-03 08:21:39 +08:00
|
|
|
switch (response & ~FAN_AUDIT) {
|
2009-12-18 10:24:34 +08:00
|
|
|
case FAN_ALLOW:
|
|
|
|
case FAN_DENY:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fd < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2018-09-22 02:20:30 +08:00
|
|
|
if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
|
2017-10-03 08:21:39 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-01-08 20:28:18 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
|
|
|
list_for_each_entry(event, &group->fanotify_data.access_list,
|
|
|
|
fae.fse.list) {
|
|
|
|
if (event->fd != fd)
|
|
|
|
continue;
|
2009-12-18 10:24:34 +08:00
|
|
|
|
2019-01-08 20:28:18 +08:00
|
|
|
list_del_init(&event->fae.fse.list);
|
2019-01-08 21:02:44 +08:00
|
|
|
finish_permission_event(group, event, response);
|
2019-01-08 20:28:18 +08:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-18 10:24:34 +08:00
|
|
|
|
2019-01-08 20:28:18 +08:00
|
|
|
return -ENOENT;
|
2009-12-18 10:24:34 +08:00
|
|
|
}
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
|
2020-07-16 16:42:26 +08:00
|
|
|
int info_type, const char *name, size_t name_len,
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
char __user *buf, size_t count)
|
2019-01-11 01:04:35 +08:00
|
|
|
{
|
|
|
|
struct fanotify_event_info_fid info = { };
|
|
|
|
struct file_handle handle = { };
|
2020-03-24 23:55:37 +08:00
|
|
|
unsigned char bounce[FANOTIFY_INLINE_FH_LEN], *fh_buf;
|
2020-03-19 23:10:21 +08:00
|
|
|
size_t fh_len = fh ? fh->len : 0;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
size_t info_len = fanotify_fid_info_len(fh_len, name_len);
|
|
|
|
size_t len = info_len;
|
2019-01-11 01:04:35 +08:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
pr_debug("%s: fh_len=%zu name_len=%zu, info_len=%zu, count=%zu\n",
|
|
|
|
__func__, fh_len, name_len, info_len, count);
|
|
|
|
|
2020-07-16 16:42:26 +08:00
|
|
|
if (!fh_len)
|
2019-01-11 01:04:35 +08:00
|
|
|
return 0;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (WARN_ON_ONCE(len < sizeof(info) || len > count))
|
2019-01-11 01:04:35 +08:00
|
|
|
return -EFAULT;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
/*
|
|
|
|
* Copy event info fid header followed by variable sized file handle
|
|
|
|
* and optionally followed by variable sized filename.
|
|
|
|
*/
|
2020-07-16 16:42:26 +08:00
|
|
|
switch (info_type) {
|
|
|
|
case FAN_EVENT_INFO_TYPE_FID:
|
|
|
|
case FAN_EVENT_INFO_TYPE_DFID:
|
|
|
|
if (WARN_ON_ONCE(name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
break;
|
|
|
|
case FAN_EVENT_INFO_TYPE_DFID_NAME:
|
|
|
|
if (WARN_ON_ONCE(!name || !name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
info.hdr.info_type = info_type;
|
2019-01-11 01:04:35 +08:00
|
|
|
info.hdr.len = len;
|
2020-03-19 23:10:20 +08:00
|
|
|
info.fsid = *fsid;
|
2019-01-11 01:04:35 +08:00
|
|
|
if (copy_to_user(buf, &info, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += sizeof(info);
|
|
|
|
len -= sizeof(info);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (WARN_ON_ONCE(len < sizeof(handle)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2020-03-24 23:55:37 +08:00
|
|
|
handle.handle_type = fh->type;
|
2019-01-11 01:04:35 +08:00
|
|
|
handle.handle_bytes = fh_len;
|
|
|
|
if (copy_to_user(buf, &handle, sizeof(handle)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += sizeof(handle);
|
|
|
|
len -= sizeof(handle);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (WARN_ON_ONCE(len < fh_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2019-03-12 19:42:37 +08:00
|
|
|
/*
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
* For an inline fh and inline file name, copy through stack to exclude
|
|
|
|
* the copy from usercopy hardening protections.
|
2019-03-12 19:42:37 +08:00
|
|
|
*/
|
2020-03-24 23:55:37 +08:00
|
|
|
fh_buf = fanotify_fh_buf(fh);
|
2019-03-12 19:42:37 +08:00
|
|
|
if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
|
2020-03-24 23:55:37 +08:00
|
|
|
memcpy(bounce, fh_buf, fh_len);
|
|
|
|
fh_buf = bounce;
|
2019-03-12 19:42:37 +08:00
|
|
|
}
|
2020-03-24 23:55:37 +08:00
|
|
|
if (copy_to_user(buf, fh_buf, fh_len))
|
2019-01-11 01:04:35 +08:00
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += fh_len;
|
|
|
|
len -= fh_len;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
|
|
|
|
if (name_len) {
|
|
|
|
/* Copy the filename with terminating null */
|
|
|
|
name_len++;
|
|
|
|
if (WARN_ON_ONCE(len < name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (copy_to_user(buf, name, name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += name_len;
|
|
|
|
len -= name_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Pad with 0's */
|
2019-01-11 01:04:35 +08:00
|
|
|
WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
|
|
|
|
if (len > 0 && clear_user(buf, len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
return info_len;
|
2019-01-11 01:04:35 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
static ssize_t copy_event_to_user(struct fsnotify_group *group,
|
2020-03-25 00:04:20 +08:00
|
|
|
struct fanotify_event *event,
|
2018-12-05 07:44:46 +08:00
|
|
|
char __user *buf, size_t count)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
2019-01-11 01:04:33 +08:00
|
|
|
struct fanotify_event_metadata metadata;
|
2020-03-25 00:04:20 +08:00
|
|
|
struct path *path = fanotify_event_path(event);
|
2020-07-16 16:42:17 +08:00
|
|
|
struct fanotify_info *info = fanotify_event_info(event);
|
2020-07-16 16:42:26 +08:00
|
|
|
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
|
2019-01-11 01:04:33 +08:00
|
|
|
struct file *f = NULL;
|
2019-01-11 01:04:34 +08:00
|
|
|
int ret, fd = FAN_NOFD;
|
2020-07-16 16:42:26 +08:00
|
|
|
int info_type = 0;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2020-03-25 00:04:20 +08:00
|
|
|
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
metadata.event_len = FAN_EVENT_METADATA_LEN +
|
2020-07-16 16:42:28 +08:00
|
|
|
fanotify_event_info_len(fid_mode, event);
|
2019-01-11 01:04:33 +08:00
|
|
|
metadata.metadata_len = FAN_EVENT_METADATA_LEN;
|
|
|
|
metadata.vers = FANOTIFY_METADATA_VERSION;
|
|
|
|
metadata.reserved = 0;
|
|
|
|
metadata.mask = event->mask & FANOTIFY_OUTGOING_EVENTS;
|
|
|
|
metadata.pid = pid_vnr(event->pid);
|
2021-03-04 19:29:21 +08:00
|
|
|
/*
|
|
|
|
* For an unprivileged listener, event->pid can be used to identify the
|
|
|
|
* events generated by the listener process itself, without disclosing
|
|
|
|
* the pids of other processes.
|
|
|
|
*/
|
2021-05-24 21:53:21 +08:00
|
|
|
if (FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV) &&
|
2021-03-04 19:29:21 +08:00
|
|
|
task_tgid(current) != event->pid)
|
|
|
|
metadata.pid = 0;
|
2019-01-11 01:04:33 +08:00
|
|
|
|
2021-05-24 21:53:21 +08:00
|
|
|
/*
|
|
|
|
* For now, fid mode is required for an unprivileged listener and
|
|
|
|
* fid mode does not report fd in events. Keep this check anyway
|
|
|
|
* for safety in case fid mode requirement is relaxed in the future
|
|
|
|
* to allow unprivileged listener to get events with no fd and no fid.
|
|
|
|
*/
|
|
|
|
if (!FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV) &&
|
|
|
|
path && path->mnt && path->dentry) {
|
2020-03-24 23:55:37 +08:00
|
|
|
fd = create_fd(group, path, &f);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
2019-01-11 01:04:33 +08:00
|
|
|
}
|
|
|
|
metadata.fd = fd;
|
2009-12-18 10:24:34 +08:00
|
|
|
|
|
|
|
ret = -EFAULT;
|
2018-12-05 07:44:46 +08:00
|
|
|
/*
|
|
|
|
* Sanity check copy size in case get_one_event() and
|
2020-05-13 02:18:36 +08:00
|
|
|
* event_len sizes ever get out of sync.
|
2018-12-05 07:44:46 +08:00
|
|
|
*/
|
2019-01-11 01:04:33 +08:00
|
|
|
if (WARN_ON_ONCE(metadata.event_len > count))
|
2018-12-05 07:44:46 +08:00
|
|
|
goto out_close_fd;
|
2019-01-11 01:04:33 +08:00
|
|
|
|
2019-01-11 01:04:35 +08:00
|
|
|
if (copy_to_user(buf, &metadata, FAN_EVENT_METADATA_LEN))
|
2012-08-20 00:30:45 +08:00
|
|
|
goto out_close_fd;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
buf += FAN_EVENT_METADATA_LEN;
|
|
|
|
count -= FAN_EVENT_METADATA_LEN;
|
|
|
|
|
2019-01-11 01:04:33 +08:00
|
|
|
if (fanotify_is_perm_event(event->mask))
|
2020-03-25 00:04:20 +08:00
|
|
|
FANOTIFY_PERM(event)->fd = fd;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (f)
|
2012-11-19 03:19:00 +08:00
|
|
|
fd_install(fd, f);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
|
|
|
|
/* Event info records order is: dir fid + name, child fid */
|
2020-07-16 16:42:17 +08:00
|
|
|
if (fanotify_event_dir_fh_len(event)) {
|
2020-07-16 16:42:30 +08:00
|
|
|
info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
|
|
|
|
FAN_EVENT_INFO_TYPE_DFID;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
ret = copy_info_to_user(fanotify_event_fsid(event),
|
2020-07-16 16:42:17 +08:00
|
|
|
fanotify_info_dir_fh(info),
|
2020-07-16 16:42:26 +08:00
|
|
|
info_type, fanotify_info_name(info),
|
2020-07-16 16:42:17 +08:00
|
|
|
info->name_len, buf, count);
|
2019-01-11 01:04:35 +08:00
|
|
|
if (ret < 0)
|
2021-06-11 11:32:06 +08:00
|
|
|
goto out_close_fd;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fanotify_event_object_fh_len(event)) {
|
2020-07-16 16:42:28 +08:00
|
|
|
const char *dot = NULL;
|
|
|
|
int dot_len = 0;
|
|
|
|
|
2020-07-16 16:42:26 +08:00
|
|
|
if (fid_mode == FAN_REPORT_FID || info_type) {
|
|
|
|
/*
|
|
|
|
* With only group flag FAN_REPORT_FID only type FID is
|
|
|
|
* reported. Second info record type is always FID.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_FID;
|
2020-07-16 16:42:28 +08:00
|
|
|
} else if ((fid_mode & FAN_REPORT_NAME) &&
|
|
|
|
(event->mask & FAN_ONDIR)) {
|
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_NAME, if name was not
|
|
|
|
* recorded in an event on a directory, report the
|
|
|
|
* name "." with info type DFID_NAME.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
|
|
|
|
dot = ".";
|
|
|
|
dot_len = 1;
|
2020-07-16 16:42:26 +08:00
|
|
|
} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
|
|
|
|
(event->mask & FAN_ONDIR)) {
|
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_DIR_FID, a single info
|
|
|
|
* record has type DFID for directory entry modification
|
|
|
|
* event and for event on a directory.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_DFID;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
|
|
|
|
* a single info record has type FID for event on a
|
|
|
|
* non-directory, when there is no directory to report.
|
|
|
|
* For example, on FAN_DELETE_SELF event.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_FID;
|
|
|
|
}
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
ret = copy_info_to_user(fanotify_event_fsid(event),
|
|
|
|
fanotify_event_object_fh(event),
|
2020-07-16 16:42:28 +08:00
|
|
|
info_type, dot, dot_len, buf, count);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
if (ret < 0)
|
2021-06-11 11:32:06 +08:00
|
|
|
goto out_close_fd;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 23:10:22 +08:00
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
2019-01-11 01:04:35 +08:00
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:33 +08:00
|
|
|
return metadata.event_len;
|
2009-12-18 10:24:34 +08:00
|
|
|
|
|
|
|
out_close_fd:
|
2012-08-20 00:30:45 +08:00
|
|
|
if (fd != FAN_NOFD) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
fput(f);
|
|
|
|
}
|
2009-12-18 10:24:34 +08:00
|
|
|
return ret;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* intofiy userspace file descriptor functions */
|
2017-07-03 13:02:18 +08:00
|
|
|
static __poll_t fanotify_poll(struct file *file, poll_table *wait)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
|
|
|
struct fsnotify_group *group = file->private_data;
|
2017-07-03 13:02:18 +08:00
|
|
|
__poll_t ret = 0;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
poll_wait(file, &group->notification_waitq, wait);
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2009-12-18 10:24:26 +08:00
|
|
|
if (!fsnotify_notify_queue_is_empty(group))
|
2018-02-12 06:34:03 +08:00
|
|
|
ret = EPOLLIN | EPOLLRDNORM;
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t fanotify_read(struct file *file, char __user *buf,
|
|
|
|
size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group;
|
2020-03-25 00:04:20 +08:00
|
|
|
struct fanotify_event *event;
|
2009-12-18 10:24:26 +08:00
|
|
|
char __user *start;
|
|
|
|
int ret;
|
2014-12-16 23:28:38 +08:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
start = buf;
|
|
|
|
group = file->private_data;
|
|
|
|
|
|
|
|
pr_debug("%s: group=%p\n", __func__, group);
|
|
|
|
|
2014-12-16 23:28:38 +08:00
|
|
|
add_wait_queue(&group->notification_waitq, &wait);
|
2009-12-18 10:24:26 +08:00
|
|
|
while (1) {
|
2020-07-15 20:06:21 +08:00
|
|
|
/*
|
|
|
|
* User can supply arbitrarily large buffer. Avoid softlockups
|
|
|
|
* in case there are lots of available events.
|
|
|
|
*/
|
|
|
|
cond_resched();
|
2020-03-25 00:04:20 +08:00
|
|
|
event = get_one_event(group, count);
|
|
|
|
if (IS_ERR(event)) {
|
|
|
|
ret = PTR_ERR(event);
|
2014-04-04 05:46:35 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2020-03-25 00:04:20 +08:00
|
|
|
if (!event) {
|
2014-04-04 05:46:35 +08:00
|
|
|
ret = -EAGAIN;
|
|
|
|
if (file->f_flags & O_NONBLOCK)
|
2009-12-18 10:24:26 +08:00
|
|
|
break;
|
2014-04-04 05:46:35 +08:00
|
|
|
|
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
if (signal_pending(current))
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (start != buf)
|
2009-12-18 10:24:26 +08:00
|
|
|
break;
|
2014-12-16 23:28:38 +08:00
|
|
|
|
|
|
|
wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
2009-12-18 10:24:26 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2020-03-25 00:04:20 +08:00
|
|
|
ret = copy_event_to_user(group, event, buf, count);
|
2017-04-25 19:29:35 +08:00
|
|
|
if (unlikely(ret == -EOPENSTALE)) {
|
|
|
|
/*
|
|
|
|
* We cannot report events with stale fd so drop it.
|
|
|
|
* Setting ret to 0 will continue the event loop and
|
|
|
|
* do the right thing if there are no more events to
|
|
|
|
* read (i.e. return bytes read, -EAGAIN or wait).
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
2014-04-04 05:46:35 +08:00
|
|
|
/*
|
|
|
|
* Permission events get queued to wait for response. Other
|
|
|
|
* events can be destroyed now.
|
|
|
|
*/
|
2020-03-25 00:04:20 +08:00
|
|
|
if (!fanotify_is_perm_event(event->mask)) {
|
|
|
|
fsnotify_destroy_event(group, &event->fse);
|
2014-04-04 05:46:36 +08:00
|
|
|
} else {
|
2017-04-25 19:29:35 +08:00
|
|
|
if (ret <= 0) {
|
2019-01-08 21:02:44 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
|
|
|
finish_permission_event(group,
|
2020-03-25 00:04:20 +08:00
|
|
|
FANOTIFY_PERM(event), FAN_DENY);
|
2014-04-04 05:46:36 +08:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
2017-04-25 19:29:35 +08:00
|
|
|
} else {
|
|
|
|
spin_lock(&group->notification_lock);
|
2020-03-25 00:04:20 +08:00
|
|
|
list_add_tail(&event->fse.list,
|
2017-04-25 19:29:35 +08:00
|
|
|
&group->fanotify_data.access_list);
|
|
|
|
spin_unlock(&group->notification_lock);
|
2014-04-04 05:46:36 +08:00
|
|
|
}
|
|
|
|
}
|
2017-04-25 19:29:35 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
2014-04-04 05:46:35 +08:00
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
2014-12-16 23:28:38 +08:00
|
|
|
remove_wait_queue(&group->notification_waitq, &wait);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
if (start != buf && ret != -EFAULT)
|
|
|
|
ret = buf - start;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
static ssize_t fanotify_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct fanotify_response response = { .fd = -1, .response = -1 };
|
|
|
|
struct fsnotify_group *group;
|
|
|
|
int ret;
|
|
|
|
|
2017-10-31 04:14:56 +08:00
|
|
|
if (!IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
group = file->private_data;
|
|
|
|
|
2020-05-13 02:19:21 +08:00
|
|
|
if (count < sizeof(response))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
count = sizeof(response);
|
2009-12-18 10:24:34 +08:00
|
|
|
|
|
|
|
pr_debug("%s: group=%p count=%zu\n", __func__, group, count);
|
|
|
|
|
|
|
|
if (copy_from_user(&response, buf, count))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
ret = process_access_response(group, &response);
|
|
|
|
if (ret < 0)
|
|
|
|
count = ret;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
static int fanotify_release(struct inode *ignored, struct file *file)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group = file->private_data;
|
2021-03-04 18:48:22 +08:00
|
|
|
struct fsnotify_event *fsn_event;
|
2010-10-29 05:21:59 +08:00
|
|
|
|
2014-08-07 07:03:28 +08:00
|
|
|
/*
|
2016-09-20 05:44:30 +08:00
|
|
|
* Stop new events from arriving in the notification queue. since
|
|
|
|
* userspace cannot use fanotify fd anymore, no event can enter or
|
|
|
|
* leave access_list by now either.
|
2014-08-07 07:03:28 +08:00
|
|
|
*/
|
2016-09-20 05:44:30 +08:00
|
|
|
fsnotify_group_stop_queueing(group);
|
2010-08-19 00:25:50 +08:00
|
|
|
|
2016-09-20 05:44:30 +08:00
|
|
|
/*
|
|
|
|
* Process all permission events on access_list and notification queue
|
|
|
|
* and simulate reply from userspace.
|
|
|
|
*/
|
2016-10-08 07:56:55 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2019-01-09 20:21:01 +08:00
|
|
|
while (!list_empty(&group->fanotify_data.access_list)) {
|
2020-03-25 00:04:20 +08:00
|
|
|
struct fanotify_perm_event *event;
|
|
|
|
|
2019-01-09 20:21:01 +08:00
|
|
|
event = list_first_entry(&group->fanotify_data.access_list,
|
|
|
|
struct fanotify_perm_event, fae.fse.list);
|
2014-04-04 05:46:33 +08:00
|
|
|
list_del_init(&event->fae.fse.list);
|
2019-01-08 21:02:44 +08:00
|
|
|
finish_permission_event(group, event, FAN_ALLOW);
|
|
|
|
spin_lock(&group->notification_lock);
|
2010-08-19 00:25:50 +08:00
|
|
|
}
|
|
|
|
|
2014-08-07 07:03:28 +08:00
|
|
|
/*
|
2016-09-20 05:44:30 +08:00
|
|
|
* Destroy all non-permission events. For permission events just
|
|
|
|
* dequeue them and set the response. They will be freed once the
|
|
|
|
* response is consumed and fanotify_get_response() returns.
|
2014-08-07 07:03:28 +08:00
|
|
|
*/
|
2021-03-04 18:48:22 +08:00
|
|
|
while ((fsn_event = fsnotify_remove_first_event(group))) {
|
|
|
|
struct fanotify_event *event = FANOTIFY_E(fsn_event);
|
2020-03-25 00:04:20 +08:00
|
|
|
|
|
|
|
if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2021-03-04 18:48:22 +08:00
|
|
|
fsnotify_destroy_event(group, fsn_event);
|
2017-10-31 04:14:56 +08:00
|
|
|
} else {
|
2020-03-25 00:04:20 +08:00
|
|
|
finish_permission_event(group, FANOTIFY_PERM(event),
|
2019-01-08 21:02:44 +08:00
|
|
|
FAN_ALLOW);
|
2017-10-31 04:14:56 +08:00
|
|
|
}
|
2019-01-08 21:02:44 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2016-09-20 05:44:30 +08:00
|
|
|
}
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2016-09-20 05:44:30 +08:00
|
|
|
|
|
|
|
/* Response for all permission events it set, wakeup waiters */
|
2010-08-19 00:25:50 +08:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
2011-10-15 05:43:39 +08:00
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/* matches the fanotify_init->fsnotify_alloc_group */
|
2011-06-14 23:29:45 +08:00
|
|
|
fsnotify_destroy_group(group);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group;
|
2014-01-22 07:48:14 +08:00
|
|
|
struct fsnotify_event *fsn_event;
|
2009-12-18 10:24:26 +08:00
|
|
|
void __user *p;
|
|
|
|
int ret = -ENOTTY;
|
|
|
|
size_t send_len = 0;
|
|
|
|
|
|
|
|
group = file->private_data;
|
|
|
|
|
|
|
|
p = (void __user *) arg;
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case FIONREAD:
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2014-01-22 07:48:14 +08:00
|
|
|
list_for_each_entry(fsn_event, &group->notification_list, list)
|
2009-12-18 10:24:26 +08:00
|
|
|
send_len += FAN_EVENT_METADATA_LEN;
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-18 10:24:26 +08:00
|
|
|
ret = put_user(send_len, (int __user *) p);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
static const struct file_operations fanotify_fops = {
|
2012-12-18 08:05:12 +08:00
|
|
|
.show_fdinfo = fanotify_show_fdinfo,
|
2009-12-18 10:24:26 +08:00
|
|
|
.poll = fanotify_poll,
|
|
|
|
.read = fanotify_read,
|
2009-12-18 10:24:34 +08:00
|
|
|
.write = fanotify_write,
|
2009-12-18 10:24:26 +08:00
|
|
|
.fasync = NULL,
|
|
|
|
.release = fanotify_release,
|
2009-12-18 10:24:26 +08:00
|
|
|
.unlocked_ioctl = fanotify_ioctl,
|
2018-09-12 03:59:08 +08:00
|
|
|
.compat_ioctl = compat_ptr_ioctl,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-16 00:52:59 +08:00
|
|
|
.llseek = noop_llseek,
|
2009-12-18 10:24:26 +08:00
|
|
|
};
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
static int fanotify_find_path(int dfd, const char __user *filename,
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
struct path *path, unsigned int flags, __u64 mask,
|
|
|
|
unsigned int obj_type)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
pr_debug("%s: dfd=%d filename=%p flags=%x\n", __func__,
|
|
|
|
dfd, filename, flags);
|
|
|
|
|
|
|
|
if (filename == NULL) {
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f = fdget(dfd);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
ret = -EBADF;
|
2012-08-29 00:52:22 +08:00
|
|
|
if (!f.file)
|
2009-12-18 10:24:26 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -ENOTDIR;
|
|
|
|
if ((flags & FAN_MARK_ONLYDIR) &&
|
2013-01-24 06:07:38 +08:00
|
|
|
!(S_ISDIR(file_inode(f.file)->i_mode))) {
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2009-12-18 10:24:26 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
*path = f.file->f_path;
|
2009-12-18 10:24:26 +08:00
|
|
|
path_get(path);
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2009-12-18 10:24:26 +08:00
|
|
|
} else {
|
|
|
|
unsigned int lookup_flags = 0;
|
|
|
|
|
|
|
|
if (!(flags & FAN_MARK_DONT_FOLLOW))
|
|
|
|
lookup_flags |= LOOKUP_FOLLOW;
|
|
|
|
if (flags & FAN_MARK_ONLYDIR)
|
|
|
|
lookup_flags |= LOOKUP_DIRECTORY;
|
|
|
|
|
|
|
|
ret = user_path_at(dfd, filename, lookup_flags, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* you can only watch an inode if you have read permissions on it */
|
2021-01-21 21:19:22 +08:00
|
|
|
ret = path_permission(path, MAY_READ);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
if (ret) {
|
|
|
|
path_put(path);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = security_path_notify(path, mask, obj_type);
|
2009-12-18 10:24:26 +08:00
|
|
|
if (ret)
|
|
|
|
path_put(path);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:33 +08:00
|
|
|
static __u32 fanotify_mark_remove_from_mask(struct fsnotify_mark *fsn_mark,
|
2020-07-16 16:42:14 +08:00
|
|
|
__u32 mask, unsigned int flags,
|
|
|
|
__u32 umask, int *destroy)
|
2009-12-18 10:24:28 +08:00
|
|
|
{
|
2015-02-11 06:08:24 +08:00
|
|
|
__u32 oldmask = 0;
|
2009-12-18 10:24:28 +08:00
|
|
|
|
2020-07-16 16:42:14 +08:00
|
|
|
/* umask bits cannot be removed by user */
|
|
|
|
mask &= ~umask;
|
2009-12-18 10:24:28 +08:00
|
|
|
spin_lock(&fsn_mark->lock);
|
2009-12-18 10:24:33 +08:00
|
|
|
if (!(flags & FAN_MARK_IGNORED_MASK)) {
|
|
|
|
oldmask = fsn_mark->mask;
|
2018-10-04 05:25:34 +08:00
|
|
|
fsn_mark->mask &= ~mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
} else {
|
2018-10-04 05:25:34 +08:00
|
|
|
fsn_mark->ignored_mask &= ~mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
}
|
2020-07-16 16:42:14 +08:00
|
|
|
/*
|
|
|
|
* We need to keep the mark around even if remaining mask cannot
|
|
|
|
* result in any events (e.g. mask == FAN_ONDIR) to support incremenal
|
|
|
|
* changes to the mask.
|
|
|
|
* Destroy mark when only umask bits remain.
|
|
|
|
*/
|
|
|
|
*destroy = !((fsn_mark->mask | fsn_mark->ignored_mask) & ~umask);
|
2009-12-18 10:24:28 +08:00
|
|
|
spin_unlock(&fsn_mark->lock);
|
|
|
|
|
|
|
|
return mask & oldmask;
|
|
|
|
}
|
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
static int fanotify_remove_mark(struct fsnotify_group *group,
|
|
|
|
fsnotify_connp_t *connp, __u32 mask,
|
2020-07-16 16:42:14 +08:00
|
|
|
unsigned int flags, __u32 umask)
|
2009-12-18 10:24:28 +08:00
|
|
|
{
|
|
|
|
struct fsnotify_mark *fsn_mark = NULL;
|
2009-12-18 10:24:28 +08:00
|
|
|
__u32 removed;
|
2011-06-14 23:29:49 +08:00
|
|
|
int destroy_mark;
|
2009-12-18 10:24:28 +08:00
|
|
|
|
2013-07-09 06:59:42 +08:00
|
|
|
mutex_lock(&group->mark_mutex);
|
2018-06-23 22:54:51 +08:00
|
|
|
fsn_mark = fsnotify_find_mark(connp, group);
|
2013-07-09 06:59:42 +08:00
|
|
|
if (!fsn_mark) {
|
|
|
|
mutex_unlock(&group->mark_mutex);
|
2009-12-18 10:24:29 +08:00
|
|
|
return -ENOENT;
|
2013-07-09 06:59:42 +08:00
|
|
|
}
|
2009-12-18 10:24:28 +08:00
|
|
|
|
2011-06-14 23:29:49 +08:00
|
|
|
removed = fanotify_mark_remove_from_mask(fsn_mark, mask, flags,
|
2020-07-16 16:42:14 +08:00
|
|
|
umask, &destroy_mark);
|
2018-06-23 22:54:50 +08:00
|
|
|
if (removed & fsnotify_conn_mask(fsn_mark->connector))
|
|
|
|
fsnotify_recalc_mask(fsn_mark->connector);
|
2011-06-14 23:29:49 +08:00
|
|
|
if (destroy_mark)
|
2015-09-05 06:43:12 +08:00
|
|
|
fsnotify_detach_mark(fsn_mark);
|
2013-07-09 06:59:42 +08:00
|
|
|
mutex_unlock(&group->mark_mutex);
|
2015-09-05 06:43:12 +08:00
|
|
|
if (destroy_mark)
|
|
|
|
fsnotify_free_mark(fsn_mark);
|
2011-06-14 23:29:49 +08:00
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
/* matches the fsnotify_find_mark() */
|
2009-12-18 10:24:29 +08:00
|
|
|
fsnotify_put_mark(fsn_mark);
|
|
|
|
return 0;
|
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
static int fanotify_remove_vfsmount_mark(struct fsnotify_group *group,
|
|
|
|
struct vfsmount *mnt, __u32 mask,
|
2020-07-16 16:42:14 +08:00
|
|
|
unsigned int flags, __u32 umask)
|
2018-06-23 22:54:51 +08:00
|
|
|
{
|
|
|
|
return fanotify_remove_mark(group, &real_mount(mnt)->mnt_fsnotify_marks,
|
2020-07-16 16:42:14 +08:00
|
|
|
mask, flags, umask);
|
2018-06-23 22:54:51 +08:00
|
|
|
}
|
|
|
|
|
2018-09-01 15:41:13 +08:00
|
|
|
static int fanotify_remove_sb_mark(struct fsnotify_group *group,
|
2020-07-16 16:42:14 +08:00
|
|
|
struct super_block *sb, __u32 mask,
|
|
|
|
unsigned int flags, __u32 umask)
|
2018-09-01 15:41:13 +08:00
|
|
|
{
|
2020-07-16 16:42:14 +08:00
|
|
|
return fanotify_remove_mark(group, &sb->s_fsnotify_marks, mask,
|
|
|
|
flags, umask);
|
2018-09-01 15:41:13 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:29 +08:00
|
|
|
static int fanotify_remove_inode_mark(struct fsnotify_group *group,
|
2009-12-18 10:24:33 +08:00
|
|
|
struct inode *inode, __u32 mask,
|
2020-07-16 16:42:14 +08:00
|
|
|
unsigned int flags, __u32 umask)
|
2009-12-18 10:24:29 +08:00
|
|
|
{
|
2018-06-23 22:54:51 +08:00
|
|
|
return fanotify_remove_mark(group, &inode->i_fsnotify_marks, mask,
|
2020-07-16 16:42:14 +08:00
|
|
|
flags, umask);
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:33 +08:00
|
|
|
static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
|
|
|
|
__u32 mask,
|
|
|
|
unsigned int flags)
|
2009-12-18 10:24:28 +08:00
|
|
|
{
|
2010-10-29 05:21:59 +08:00
|
|
|
__u32 oldmask = -1;
|
2009-12-18 10:24:28 +08:00
|
|
|
|
|
|
|
spin_lock(&fsn_mark->lock);
|
2009-12-18 10:24:33 +08:00
|
|
|
if (!(flags & FAN_MARK_IGNORED_MASK)) {
|
|
|
|
oldmask = fsn_mark->mask;
|
2018-10-04 05:25:34 +08:00
|
|
|
fsn_mark->mask |= mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
} else {
|
2018-10-04 05:25:34 +08:00
|
|
|
fsn_mark->ignored_mask |= mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
if (flags & FAN_MARK_IGNORED_SURV_MODIFY)
|
|
|
|
fsn_mark->flags |= FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY;
|
2009-12-18 10:24:33 +08:00
|
|
|
}
|
2009-12-18 10:24:28 +08:00
|
|
|
spin_unlock(&fsn_mark->lock);
|
|
|
|
|
|
|
|
return mask & ~oldmask;
|
|
|
|
}
|
|
|
|
|
2013-07-09 06:59:43 +08:00
|
|
|
static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
|
2018-06-23 22:54:48 +08:00
|
|
|
fsnotify_connp_t *connp,
|
2019-01-11 01:04:37 +08:00
|
|
|
unsigned int type,
|
|
|
|
__kernel_fsid_t *fsid)
|
2013-07-09 06:59:43 +08:00
|
|
|
{
|
2021-03-04 19:29:20 +08:00
|
|
|
struct ucounts *ucounts = group->fanotify_data.ucounts;
|
2013-07-09 06:59:43 +08:00
|
|
|
struct fsnotify_mark *mark;
|
|
|
|
int ret;
|
|
|
|
|
2021-03-04 19:29:20 +08:00
|
|
|
/*
|
|
|
|
* Enforce per user marks limits per user in all containing user ns.
|
|
|
|
* A group with FAN_UNLIMITED_MARKS does not contribute to mark count
|
|
|
|
* in the limited groups account.
|
|
|
|
*/
|
|
|
|
if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS) &&
|
|
|
|
!inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
|
2013-07-09 06:59:43 +08:00
|
|
|
return ERR_PTR(-ENOSPC);
|
|
|
|
|
|
|
|
mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
|
2021-03-04 19:29:20 +08:00
|
|
|
if (!mark) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_dec_ucounts;
|
|
|
|
}
|
2013-07-09 06:59:43 +08:00
|
|
|
|
2016-12-22 01:06:12 +08:00
|
|
|
fsnotify_init_mark(mark, group);
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fsnotify_add_mark_locked(mark, connp, type, 0, fsid);
|
2013-07-09 06:59:43 +08:00
|
|
|
if (ret) {
|
|
|
|
fsnotify_put_mark(mark);
|
2021-03-04 19:29:20 +08:00
|
|
|
goto out_dec_ucounts;
|
2013-07-09 06:59:43 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return mark;
|
2021-03-04 19:29:20 +08:00
|
|
|
|
|
|
|
out_dec_ucounts:
|
|
|
|
if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS))
|
|
|
|
dec_ucount(ucounts, UCOUNT_FANOTIFY_MARKS);
|
|
|
|
return ERR_PTR(ret);
|
2013-07-09 06:59:43 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
static int fanotify_add_mark(struct fsnotify_group *group,
|
|
|
|
fsnotify_connp_t *connp, unsigned int type,
|
2019-01-11 01:04:37 +08:00
|
|
|
__u32 mask, unsigned int flags,
|
|
|
|
__kernel_fsid_t *fsid)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
|
|
|
struct fsnotify_mark *fsn_mark;
|
2009-12-18 10:24:28 +08:00
|
|
|
__u32 added;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2013-07-09 06:59:42 +08:00
|
|
|
mutex_lock(&group->mark_mutex);
|
2018-06-23 22:54:48 +08:00
|
|
|
fsn_mark = fsnotify_find_mark(connp, group);
|
2009-12-18 10:24:28 +08:00
|
|
|
if (!fsn_mark) {
|
2019-01-11 01:04:37 +08:00
|
|
|
fsn_mark = fanotify_add_new_mark(group, connp, type, fsid);
|
2013-07-09 06:59:43 +08:00
|
|
|
if (IS_ERR(fsn_mark)) {
|
2013-07-09 06:59:42 +08:00
|
|
|
mutex_unlock(&group->mark_mutex);
|
2013-07-09 06:59:43 +08:00
|
|
|
return PTR_ERR(fsn_mark);
|
2013-07-09 06:59:42 +08:00
|
|
|
}
|
2009-12-18 10:24:28 +08:00
|
|
|
}
|
2009-12-18 10:24:33 +08:00
|
|
|
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
|
2018-06-23 22:54:50 +08:00
|
|
|
if (added & ~fsnotify_conn_mask(fsn_mark->connector))
|
|
|
|
fsnotify_recalc_mask(fsn_mark->connector);
|
2016-12-14 20:53:46 +08:00
|
|
|
mutex_unlock(&group->mark_mutex);
|
2013-07-09 06:59:43 +08:00
|
|
|
|
2010-11-10 01:18:16 +08:00
|
|
|
fsnotify_put_mark(fsn_mark);
|
2013-07-09 06:59:43 +08:00
|
|
|
return 0;
|
2009-12-18 10:24:28 +08:00
|
|
|
}
|
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
static int fanotify_add_vfsmount_mark(struct fsnotify_group *group,
|
|
|
|
struct vfsmount *mnt, __u32 mask,
|
2019-01-11 01:04:37 +08:00
|
|
|
unsigned int flags, __kernel_fsid_t *fsid)
|
2018-06-23 22:54:51 +08:00
|
|
|
{
|
|
|
|
return fanotify_add_mark(group, &real_mount(mnt)->mnt_fsnotify_marks,
|
2019-01-11 01:04:37 +08:00
|
|
|
FSNOTIFY_OBJ_TYPE_VFSMOUNT, mask, flags, fsid);
|
2018-06-23 22:54:51 +08:00
|
|
|
}
|
|
|
|
|
2018-09-01 15:41:13 +08:00
|
|
|
static int fanotify_add_sb_mark(struct fsnotify_group *group,
|
2019-01-11 01:04:37 +08:00
|
|
|
struct super_block *sb, __u32 mask,
|
|
|
|
unsigned int flags, __kernel_fsid_t *fsid)
|
2018-09-01 15:41:13 +08:00
|
|
|
{
|
|
|
|
return fanotify_add_mark(group, &sb->s_fsnotify_marks,
|
2019-01-11 01:04:37 +08:00
|
|
|
FSNOTIFY_OBJ_TYPE_SB, mask, flags, fsid);
|
2018-09-01 15:41:13 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:28 +08:00
|
|
|
static int fanotify_add_inode_mark(struct fsnotify_group *group,
|
2009-12-18 10:24:33 +08:00
|
|
|
struct inode *inode, __u32 mask,
|
2019-01-11 01:04:37 +08:00
|
|
|
unsigned int flags, __kernel_fsid_t *fsid)
|
2009-12-18 10:24:28 +08:00
|
|
|
{
|
|
|
|
pr_debug("%s: group=%p inode=%p\n", __func__, group, inode);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2010-10-29 05:21:57 +08:00
|
|
|
/*
|
|
|
|
* If some other task has this inode open for write we should not add
|
|
|
|
* an ignored mark, unless that ignored mark is supposed to survive
|
|
|
|
* modification changes anyway.
|
|
|
|
*/
|
|
|
|
if ((flags & FAN_MARK_IGNORED_MASK) &&
|
|
|
|
!(flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
|
2018-12-11 16:27:23 +08:00
|
|
|
inode_is_open_for_write(inode))
|
2010-10-29 05:21:57 +08:00
|
|
|
return 0;
|
|
|
|
|
2018-06-23 22:54:51 +08:00
|
|
|
return fanotify_add_mark(group, &inode->i_fsnotify_marks,
|
2019-01-11 01:04:37 +08:00
|
|
|
FSNOTIFY_OBJ_TYPE_INODE, mask, flags, fsid);
|
2009-12-18 10:24:28 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2020-07-08 19:11:42 +08:00
|
|
|
static struct fsnotify_event *fanotify_alloc_overflow_event(void)
|
|
|
|
{
|
|
|
|
struct fanotify_event *oevent;
|
|
|
|
|
|
|
|
oevent = kmalloc(sizeof(*oevent), GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!oevent)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
fanotify_init_event(oevent, 0, FS_Q_OVERFLOW);
|
|
|
|
oevent->type = FANOTIFY_EVENT_TYPE_OVERFLOW;
|
|
|
|
|
|
|
|
return &oevent->fse;
|
|
|
|
}
|
|
|
|
|
2021-03-04 18:48:25 +08:00
|
|
|
static struct hlist_head *fanotify_alloc_merge_hash(void)
|
|
|
|
{
|
|
|
|
struct hlist_head *hash;
|
|
|
|
|
|
|
|
hash = kmalloc(sizeof(struct hlist_head) << FANOTIFY_HTABLE_BITS,
|
|
|
|
GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!hash)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
__hash_init(hash, FANOTIFY_HTABLE_SIZE);
|
|
|
|
|
|
|
|
return hash;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/* fanotify syscalls */
|
2010-05-27 21:41:40 +08:00
|
|
|
SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
|
2009-12-18 10:24:25 +08:00
|
|
|
{
|
2009-12-18 10:24:26 +08:00
|
|
|
struct fsnotify_group *group;
|
|
|
|
int f_flags, fd;
|
2020-07-16 16:42:26 +08:00
|
|
|
unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
|
|
|
|
unsigned int class = flags & FANOTIFY_CLASS_BITS;
|
2021-05-24 21:53:21 +08:00
|
|
|
unsigned int internal_flags = 0;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2018-09-22 02:20:30 +08:00
|
|
|
pr_debug("%s: flags=%x event_f_flags=%x\n",
|
|
|
|
__func__, flags, event_f_flags);
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2021-03-04 19:29:21 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
/*
|
|
|
|
* An unprivileged user can setup an fanotify group with
|
|
|
|
* limited functionality - an unprivileged group is limited to
|
|
|
|
* notification events with file handles and it cannot use
|
|
|
|
* unlimited queue/marks.
|
|
|
|
*/
|
|
|
|
if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
|
|
|
|
return -EPERM;
|
2021-05-24 21:53:21 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting the internal flag FANOTIFY_UNPRIV on the group
|
|
|
|
* prevents setting mount/filesystem marks on this group and
|
|
|
|
* prevents reporting pid and open fd in events.
|
|
|
|
*/
|
|
|
|
internal_flags |= FANOTIFY_UNPRIV;
|
2021-03-04 19:29:21 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2017-10-03 08:21:39 +08:00
|
|
|
#ifdef CONFIG_AUDITSYSCALL
|
2018-10-04 05:25:35 +08:00
|
|
|
if (flags & ~(FANOTIFY_INIT_FLAGS | FAN_ENABLE_AUDIT))
|
2017-10-03 08:21:39 +08:00
|
|
|
#else
|
2018-10-04 05:25:35 +08:00
|
|
|
if (flags & ~FANOTIFY_INIT_FLAGS)
|
2017-10-03 08:21:39 +08:00
|
|
|
#endif
|
2009-12-18 10:24:26 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
fanotify: check file flags passed in fanotify_init
Without this patch fanotify_init does not validate the value passed in
event_f_flags.
When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.
Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.
Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539
This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10
With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.
Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.
Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.
This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.
O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.
O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.
__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:05:44 +08:00
|
|
|
if (event_f_flags & ~FANOTIFY_INIT_ALL_EVENT_F_BITS)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
switch (event_f_flags & O_ACCMODE) {
|
|
|
|
case O_RDONLY:
|
|
|
|
case O_RDWR:
|
|
|
|
case O_WRONLY:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2020-07-16 16:42:26 +08:00
|
|
|
if (fid_mode && class != FAN_CLASS_NOTIF)
|
2019-01-11 01:04:36 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-16 16:42:28 +08:00
|
|
|
/*
|
|
|
|
* Child name is reported with parent fid so requires dir fid.
|
2020-07-16 16:42:30 +08:00
|
|
|
* We can report both child fid and dir fid with or without name.
|
2020-07-16 16:42:28 +08:00
|
|
|
*/
|
2020-07-16 16:42:30 +08:00
|
|
|
if ((fid_mode & FAN_REPORT_NAME) && !(fid_mode & FAN_REPORT_DIR_FID))
|
2020-07-16 16:42:26 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
f_flags = O_RDWR | FMODE_NONOTIFY;
|
2009-12-18 10:24:26 +08:00
|
|
|
if (flags & FAN_CLOEXEC)
|
|
|
|
f_flags |= O_CLOEXEC;
|
|
|
|
if (flags & FAN_NONBLOCK)
|
|
|
|
f_flags |= O_NONBLOCK;
|
|
|
|
|
|
|
|
/* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
|
2020-12-20 12:46:08 +08:00
|
|
|
group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops);
|
2010-11-24 12:48:26 +08:00
|
|
|
if (IS_ERR(group)) {
|
2009-12-18 10:24:26 +08:00
|
|
|
return PTR_ERR(group);
|
2010-11-24 12:48:26 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2021-03-04 19:29:20 +08:00
|
|
|
/* Enforce groups limits per user in all containing user ns */
|
|
|
|
group->fanotify_data.ucounts = inc_ucount(current_user_ns(),
|
|
|
|
current_euid(),
|
|
|
|
UCOUNT_FANOTIFY_GROUPS);
|
|
|
|
if (!group->fanotify_data.ucounts) {
|
|
|
|
fd = -EMFILE;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2021-05-24 21:53:21 +08:00
|
|
|
group->fanotify_data.flags = flags | internal_flags;
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
group->memcg = get_mem_cgroup_from_mm(current->mm);
|
2010-10-29 05:21:58 +08:00
|
|
|
|
2021-03-04 18:48:25 +08:00
|
|
|
group->fanotify_data.merge_hash = fanotify_alloc_merge_hash();
|
|
|
|
if (!group->fanotify_data.merge_hash) {
|
|
|
|
fd = -ENOMEM;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2020-07-08 19:11:42 +08:00
|
|
|
group->overflow_event = fanotify_alloc_overflow_event();
|
|
|
|
if (unlikely(!group->overflow_event)) {
|
2014-02-22 02:14:11 +08:00
|
|
|
fd = -ENOMEM;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2014-05-07 03:50:10 +08:00
|
|
|
if (force_o_largefile())
|
|
|
|
event_f_flags |= O_LARGEFILE;
|
2010-07-28 22:18:37 +08:00
|
|
|
group->fanotify_data.f_flags = event_f_flags;
|
2009-12-18 10:24:34 +08:00
|
|
|
init_waitqueue_head(&group->fanotify_data.access_waitq);
|
|
|
|
INIT_LIST_HEAD(&group->fanotify_data.access_list);
|
2020-07-16 16:42:26 +08:00
|
|
|
switch (class) {
|
2010-10-29 05:21:56 +08:00
|
|
|
case FAN_CLASS_NOTIF:
|
|
|
|
group->priority = FS_PRIO_0;
|
|
|
|
break;
|
|
|
|
case FAN_CLASS_CONTENT:
|
|
|
|
group->priority = FS_PRIO_1;
|
|
|
|
break;
|
|
|
|
case FAN_CLASS_PRE_CONTENT:
|
|
|
|
group->priority = FS_PRIO_2;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
fd = -EINVAL;
|
2011-06-14 23:29:45 +08:00
|
|
|
goto out_destroy_group;
|
2010-10-29 05:21:56 +08:00
|
|
|
}
|
2009-12-18 10:24:34 +08:00
|
|
|
|
2010-10-29 05:21:57 +08:00
|
|
|
if (flags & FAN_UNLIMITED_QUEUE) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
2011-06-14 23:29:45 +08:00
|
|
|
goto out_destroy_group;
|
2010-10-29 05:21:57 +08:00
|
|
|
group->max_events = UINT_MAX;
|
|
|
|
} else {
|
2021-03-04 19:29:20 +08:00
|
|
|
group->max_events = fanotify_max_queued_events;
|
2010-10-29 05:21:57 +08:00
|
|
|
}
|
2010-10-29 05:21:57 +08:00
|
|
|
|
2010-10-29 05:21:58 +08:00
|
|
|
if (flags & FAN_UNLIMITED_MARKS) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
2011-06-14 23:29:45 +08:00
|
|
|
goto out_destroy_group;
|
2010-10-29 05:21:58 +08:00
|
|
|
}
|
2010-10-29 05:21:57 +08:00
|
|
|
|
2017-10-03 08:21:39 +08:00
|
|
|
if (flags & FAN_ENABLE_AUDIT) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_AUDIT_WRITE))
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
|
|
|
|
if (fd < 0)
|
2011-06-14 23:29:45 +08:00
|
|
|
goto out_destroy_group;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
return fd;
|
|
|
|
|
2011-06-14 23:29:45 +08:00
|
|
|
out_destroy_group:
|
|
|
|
fsnotify_destroy_group(group);
|
2009-12-18 10:24:26 +08:00
|
|
|
return fd;
|
2009-12-18 10:24:25 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2019-01-11 01:04:36 +08:00
|
|
|
/* Check if filesystem can encode a unique fid */
|
2019-01-11 01:04:39 +08:00
|
|
|
static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
|
2019-01-11 01:04:36 +08:00
|
|
|
{
|
2019-01-11 01:04:39 +08:00
|
|
|
__kernel_fsid_t root_fsid;
|
2019-01-11 01:04:36 +08:00
|
|
|
int err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure path is not in filesystem with zero fsid (e.g. tmpfs).
|
|
|
|
*/
|
2019-01-11 01:04:39 +08:00
|
|
|
err = vfs_get_fsid(path->dentry, fsid);
|
2019-01-11 01:04:36 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2019-01-11 01:04:39 +08:00
|
|
|
if (!fsid->val[0] && !fsid->val[1])
|
2019-01-11 01:04:36 +08:00
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure path is not inside a filesystem subvolume (e.g. btrfs)
|
|
|
|
* which uses a different fsid than sb root.
|
|
|
|
*/
|
2019-01-11 01:04:39 +08:00
|
|
|
err = vfs_get_fsid(path->dentry->d_sb->s_root, &root_fsid);
|
2019-01-11 01:04:36 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2019-01-11 01:04:39 +08:00
|
|
|
if (root_fsid.val[0] != fsid->val[0] ||
|
|
|
|
root_fsid.val[1] != fsid->val[1])
|
2019-01-11 01:04:36 +08:00
|
|
|
return -EXDEV;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to make sure that the file system supports at least
|
|
|
|
* encoding a file handle so user can use name_to_handle_at() to
|
|
|
|
* compare fid returned with event to the file handle of watched
|
|
|
|
* objects. However, name_to_handle_at() requires that the
|
|
|
|
* filesystem also supports decoding file handles.
|
|
|
|
*/
|
|
|
|
if (!path->dentry->d_sb->s_export_op ||
|
|
|
|
!path->dentry->d_sb->s_export_op->fh_to_dentry)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-05-15 22:28:34 +08:00
|
|
|
static int fanotify_events_supported(struct path *path, __u64 mask)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Some filesystems such as 'proc' acquire unusual locks when opening
|
|
|
|
* files. For them fanotify permission events have high chances of
|
|
|
|
* deadlocking the system - open done when reporting fanotify event
|
|
|
|
* blocks on this "unusual" lock while another process holding the lock
|
|
|
|
* waits for fanotify permission event to be answered. Just disallow
|
|
|
|
* permission events for such filesystems.
|
|
|
|
*/
|
|
|
|
if (mask & FANOTIFY_PERM_EVENTS &&
|
|
|
|
path->mnt->mnt_sb->s_type->fs_flags & FS_DISALLOW_NOTIFY_PERM)
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-03-17 22:06:11 +08:00
|
|
|
static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
|
|
|
|
int dfd, const char __user *pathname)
|
2009-12-18 10:24:26 +08:00
|
|
|
{
|
2009-12-18 10:24:29 +08:00
|
|
|
struct inode *inode = NULL;
|
|
|
|
struct vfsmount *mnt = NULL;
|
2009-12-18 10:24:26 +08:00
|
|
|
struct fsnotify_group *group;
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
2009-12-18 10:24:26 +08:00
|
|
|
struct path path;
|
2019-01-11 01:04:39 +08:00
|
|
|
__kernel_fsid_t __fsid, *fsid = NULL;
|
2018-10-04 05:25:37 +08:00
|
|
|
u32 valid_mask = FANOTIFY_EVENTS | FANOTIFY_EVENT_FLAGS;
|
2018-10-04 05:25:35 +08:00
|
|
|
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
|
2020-07-16 16:42:13 +08:00
|
|
|
bool ignored = flags & FAN_MARK_IGNORED_MASK;
|
2020-07-16 16:42:12 +08:00
|
|
|
unsigned int obj_type, fid_mode;
|
2020-07-16 16:42:15 +08:00
|
|
|
u32 umask = 0;
|
2012-08-29 00:52:22 +08:00
|
|
|
int ret;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
|
|
|
pr_debug("%s: fanotify_fd=%d flags=%x dfd=%d pathname=%p mask=%llx\n",
|
|
|
|
__func__, fanotify_fd, flags, dfd, pathname, mask);
|
|
|
|
|
|
|
|
/* we only use the lower 32 bits as of right now. */
|
2021-03-25 16:37:43 +08:00
|
|
|
if (upper_32_bits(mask))
|
2009-12-18 10:24:26 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2018-10-04 05:25:35 +08:00
|
|
|
if (flags & ~FANOTIFY_MARK_FLAGS)
|
2009-12-18 10:24:29 +08:00
|
|
|
return -EINVAL;
|
2018-09-01 15:41:13 +08:00
|
|
|
|
|
|
|
switch (mark_type) {
|
|
|
|
case FAN_MARK_INODE:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_INODE;
|
|
|
|
break;
|
2018-09-01 15:41:13 +08:00
|
|
|
case FAN_MARK_MOUNT:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_VFSMOUNT;
|
|
|
|
break;
|
2018-09-01 15:41:13 +08:00
|
|
|
case FAN_MARK_FILESYSTEM:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_SB;
|
2018-09-01 15:41:13 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
switch (flags & (FAN_MARK_ADD | FAN_MARK_REMOVE | FAN_MARK_FLUSH)) {
|
2020-08-24 06:36:59 +08:00
|
|
|
case FAN_MARK_ADD:
|
2009-12-18 10:24:29 +08:00
|
|
|
case FAN_MARK_REMOVE:
|
2010-11-23 01:46:33 +08:00
|
|
|
if (!mask)
|
|
|
|
return -EINVAL;
|
2014-06-05 07:05:43 +08:00
|
|
|
break;
|
2009-12-18 10:24:34 +08:00
|
|
|
case FAN_MARK_FLUSH:
|
2018-10-04 05:25:35 +08:00
|
|
|
if (flags & ~(FANOTIFY_MARK_TYPE_BITS | FAN_MARK_FLUSH))
|
2014-06-05 07:05:43 +08:00
|
|
|
return -EINVAL;
|
2009-12-18 10:24:29 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2010-10-29 05:21:59 +08:00
|
|
|
|
2017-10-31 04:14:56 +08:00
|
|
|
if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS))
|
2018-10-04 05:25:35 +08:00
|
|
|
valid_mask |= FANOTIFY_PERM_EVENTS;
|
2017-10-31 04:14:56 +08:00
|
|
|
|
|
|
|
if (mask & ~valid_mask)
|
2009-12-18 10:24:26 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-16 16:42:13 +08:00
|
|
|
/* Event flags (ONDIR, ON_CHILD) are meaningless in ignored mask */
|
|
|
|
if (ignored)
|
|
|
|
mask &= ~FANOTIFY_EVENT_FLAGS;
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(fanotify_fd);
|
|
|
|
if (unlikely(!f.file))
|
2009-12-18 10:24:26 +08:00
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
/* verify that this is indeed an fanotify instance */
|
|
|
|
ret = -EINVAL;
|
2012-08-29 00:52:22 +08:00
|
|
|
if (unlikely(f.file->f_op != &fanotify_fops))
|
2009-12-18 10:24:26 +08:00
|
|
|
goto fput_and_out;
|
2012-08-29 00:52:22 +08:00
|
|
|
group = f.file->private_data;
|
2010-10-29 05:21:56 +08:00
|
|
|
|
2021-03-04 19:29:21 +08:00
|
|
|
/*
|
2021-05-24 21:53:21 +08:00
|
|
|
* An unprivileged user is not allowed to setup mount nor filesystem
|
|
|
|
* marks. This also includes setting up such marks by a group that
|
|
|
|
* was initialized by an unprivileged user.
|
2021-03-04 19:29:21 +08:00
|
|
|
*/
|
|
|
|
ret = -EPERM;
|
2021-05-24 21:53:21 +08:00
|
|
|
if ((!capable(CAP_SYS_ADMIN) ||
|
|
|
|
FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
|
2021-03-04 19:29:21 +08:00
|
|
|
mark_type != FAN_MARK_INODE)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2010-10-29 05:21:56 +08:00
|
|
|
/*
|
|
|
|
* group->priority == FS_PRIO_0 == FAN_CLASS_NOTIF. These are not
|
|
|
|
* allowed to set permissions events.
|
|
|
|
*/
|
|
|
|
ret = -EINVAL;
|
2018-10-04 05:25:35 +08:00
|
|
|
if (mask & FANOTIFY_PERM_EVENTS &&
|
2010-10-29 05:21:56 +08:00
|
|
|
group->priority == FS_PRIO_0)
|
|
|
|
goto fput_and_out;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2019-01-11 01:04:43 +08:00
|
|
|
/*
|
|
|
|
* Events with data type inode do not carry enough information to report
|
|
|
|
* event->fd, so we do not allow setting a mask for inode events unless
|
|
|
|
* group supports reporting fid.
|
|
|
|
* inode events are not supported on a mount mark, because they do not
|
|
|
|
* carry enough information (i.e. path) to be filtered by mount point.
|
|
|
|
*/
|
2020-07-16 16:42:12 +08:00
|
|
|
fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
|
2019-01-11 01:04:43 +08:00
|
|
|
if (mask & FANOTIFY_INODE_EVENTS &&
|
2020-07-16 16:42:12 +08:00
|
|
|
(!fid_mode || mark_type == FAN_MARK_MOUNT))
|
2019-01-11 01:04:43 +08:00
|
|
|
goto fput_and_out;
|
|
|
|
|
2014-06-05 07:05:40 +08:00
|
|
|
if (flags & FAN_MARK_FLUSH) {
|
|
|
|
ret = 0;
|
2018-09-01 15:41:13 +08:00
|
|
|
if (mark_type == FAN_MARK_MOUNT)
|
2014-06-05 07:05:40 +08:00
|
|
|
fsnotify_clear_vfsmount_marks_by_group(group);
|
2018-09-01 15:41:13 +08:00
|
|
|
else if (mark_type == FAN_MARK_FILESYSTEM)
|
|
|
|
fsnotify_clear_sb_marks_by_group(group);
|
2014-06-05 07:05:40 +08:00
|
|
|
else
|
|
|
|
fsnotify_clear_inode_marks_by_group(group);
|
|
|
|
goto fput_and_out;
|
|
|
|
}
|
|
|
|
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
ret = fanotify_find_path(dfd, pathname, &path, flags,
|
|
|
|
(mask & ALL_FSNOTIFY_EVENTS), obj_type);
|
2009-12-18 10:24:26 +08:00
|
|
|
if (ret)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2019-05-15 22:28:34 +08:00
|
|
|
if (flags & FAN_MARK_ADD) {
|
|
|
|
ret = fanotify_events_supported(&path, mask);
|
|
|
|
if (ret)
|
|
|
|
goto path_put_and_out;
|
|
|
|
}
|
|
|
|
|
2020-07-16 16:42:12 +08:00
|
|
|
if (fid_mode) {
|
2019-01-11 01:04:39 +08:00
|
|
|
ret = fanotify_test_fid(&path, &__fsid);
|
2019-01-11 01:04:36 +08:00
|
|
|
if (ret)
|
|
|
|
goto path_put_and_out;
|
2019-01-11 01:04:37 +08:00
|
|
|
|
2019-01-11 01:04:39 +08:00
|
|
|
fsid = &__fsid;
|
2019-01-11 01:04:36 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/* inode held in place by reference to path; group by fget on fd */
|
2018-09-01 15:41:13 +08:00
|
|
|
if (mark_type == FAN_MARK_INODE)
|
2009-12-18 10:24:29 +08:00
|
|
|
inode = path.dentry->d_inode;
|
|
|
|
else
|
|
|
|
mnt = path.mnt;
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2020-07-16 16:42:15 +08:00
|
|
|
/* Mask out FAN_EVENT_ON_CHILD flag for sb/mount/non-dir marks */
|
|
|
|
if (mnt || !S_ISDIR(inode->i_mode)) {
|
|
|
|
mask &= ~FAN_EVENT_ON_CHILD;
|
|
|
|
umask = FAN_EVENT_ON_CHILD;
|
2020-07-16 16:42:27 +08:00
|
|
|
/*
|
|
|
|
* If group needs to report parent fid, register for getting
|
|
|
|
* events with parent/name info for non-directory.
|
|
|
|
*/
|
|
|
|
if ((fid_mode & FAN_REPORT_DIR_FID) &&
|
|
|
|
(flags & FAN_MARK_ADD) && !ignored)
|
|
|
|
mask |= FAN_EVENT_ON_CHILD;
|
2020-07-16 16:42:15 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/* create/update an inode mark */
|
2014-06-05 07:05:40 +08:00
|
|
|
switch (flags & (FAN_MARK_ADD | FAN_MARK_REMOVE)) {
|
2009-12-18 10:24:28 +08:00
|
|
|
case FAN_MARK_ADD:
|
2018-09-01 15:41:13 +08:00
|
|
|
if (mark_type == FAN_MARK_MOUNT)
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_add_vfsmount_mark(group, mnt, mask,
|
|
|
|
flags, fsid);
|
2018-09-01 15:41:13 +08:00
|
|
|
else if (mark_type == FAN_MARK_FILESYSTEM)
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_add_sb_mark(group, mnt->mnt_sb, mask,
|
|
|
|
flags, fsid);
|
2009-12-18 10:24:29 +08:00
|
|
|
else
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_add_inode_mark(group, inode, mask,
|
|
|
|
flags, fsid);
|
2009-12-18 10:24:28 +08:00
|
|
|
break;
|
|
|
|
case FAN_MARK_REMOVE:
|
2018-09-01 15:41:13 +08:00
|
|
|
if (mark_type == FAN_MARK_MOUNT)
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_remove_vfsmount_mark(group, mnt, mask,
|
2020-07-16 16:42:15 +08:00
|
|
|
flags, umask);
|
2018-09-01 15:41:13 +08:00
|
|
|
else if (mark_type == FAN_MARK_FILESYSTEM)
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_remove_sb_mark(group, mnt->mnt_sb, mask,
|
2020-07-16 16:42:15 +08:00
|
|
|
flags, umask);
|
2009-12-18 10:24:29 +08:00
|
|
|
else
|
2019-01-11 01:04:37 +08:00
|
|
|
ret = fanotify_remove_inode_mark(group, inode, mask,
|
2020-07-16 16:42:15 +08:00
|
|
|
flags, umask);
|
2009-12-18 10:24:28 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2019-01-11 01:04:36 +08:00
|
|
|
path_put_and_out:
|
2009-12-18 10:24:26 +08:00
|
|
|
path_put(&path);
|
|
|
|
fput_and_out:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2009-12-18 10:24:26 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-12-01 06:30:59 +08:00
|
|
|
#ifndef CONFIG_ARCH_SPLIT_ARG64
|
2018-03-17 22:06:11 +08:00
|
|
|
SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
|
|
|
|
__u64, mask, int, dfd,
|
|
|
|
const char __user *, pathname)
|
|
|
|
{
|
|
|
|
return do_fanotify_mark(fanotify_fd, flags, mask, dfd, pathname);
|
|
|
|
}
|
2020-12-01 06:30:59 +08:00
|
|
|
#endif
|
2018-03-17 22:06:11 +08:00
|
|
|
|
2020-12-01 06:30:59 +08:00
|
|
|
#if defined(CONFIG_ARCH_SPLIT_ARG64) || defined(CONFIG_COMPAT)
|
|
|
|
SYSCALL32_DEFINE6(fanotify_mark,
|
2013-03-06 09:10:59 +08:00
|
|
|
int, fanotify_fd, unsigned int, flags,
|
2020-12-01 06:30:59 +08:00
|
|
|
SC_ARG64(mask), int, dfd,
|
2013-03-06 09:10:59 +08:00
|
|
|
const char __user *, pathname)
|
|
|
|
{
|
2020-12-01 06:30:59 +08:00
|
|
|
return do_fanotify_mark(fanotify_fd, flags, SC_VAL64(__u64, mask),
|
|
|
|
dfd, pathname);
|
2013-03-06 09:10:59 +08:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
/*
|
2011-03-01 22:06:02 +08:00
|
|
|
* fanotify_user_setup - Our initialization function. Note that we cannot return
|
2009-12-18 10:24:26 +08:00
|
|
|
* error because we have compiled-in VFS hooks. So an (unlikely) failure here
|
|
|
|
* must result in panic().
|
|
|
|
*/
|
|
|
|
static int __init fanotify_user_setup(void)
|
|
|
|
{
|
2021-03-04 19:29:20 +08:00
|
|
|
struct sysinfo si;
|
|
|
|
int max_marks;
|
|
|
|
|
|
|
|
si_meminfo(&si);
|
|
|
|
/*
|
|
|
|
* Allow up to 1% of addressable memory to be accounted for per user
|
|
|
|
* marks limited to the range [8192, 1048576]. mount and sb marks are
|
|
|
|
* a lot cheaper than inode marks, but there is no reason for a user
|
|
|
|
* to have many of those, so calculate by the cost of inode marks.
|
|
|
|
*/
|
|
|
|
max_marks = (((si.totalram - si.totalhigh) / 100) << PAGE_SHIFT) /
|
|
|
|
INODE_MARK_COST;
|
|
|
|
max_marks = clamp(max_marks, FANOTIFY_OLD_DEFAULT_MAX_MARKS,
|
|
|
|
FANOTIFY_DEFAULT_MAX_USER_MARKS);
|
|
|
|
|
2021-05-24 21:53:21 +08:00
|
|
|
BUILD_BUG_ON(FANOTIFY_INIT_FLAGS & FANOTIFY_INTERNAL_GROUP_FLAGS);
|
2020-07-16 16:42:28 +08:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
|
2018-10-04 05:25:37 +08:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
|
|
|
|
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
|
|
|
|
SLAB_PANIC|SLAB_ACCOUNT);
|
2020-03-25 00:04:20 +08:00
|
|
|
fanotify_fid_event_cachep = KMEM_CACHE(fanotify_fid_event,
|
|
|
|
SLAB_PANIC);
|
|
|
|
fanotify_path_event_cachep = KMEM_CACHE(fanotify_path_event,
|
|
|
|
SLAB_PANIC);
|
2017-10-31 04:14:56 +08:00
|
|
|
if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
|
|
|
|
fanotify_perm_event_cachep =
|
2019-01-11 01:04:32 +08:00
|
|
|
KMEM_CACHE(fanotify_perm_event, SLAB_PANIC);
|
2017-10-31 04:14:56 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
|
2021-03-04 19:29:20 +08:00
|
|
|
fanotify_max_queued_events = FANOTIFY_DEFAULT_MAX_EVENTS;
|
|
|
|
init_user_ns.ucount_max[UCOUNT_FANOTIFY_GROUPS] =
|
|
|
|
FANOTIFY_DEFAULT_MAX_GROUPS;
|
|
|
|
init_user_ns.ucount_max[UCOUNT_FANOTIFY_MARKS] = max_marks;
|
|
|
|
|
2009-12-18 10:24:26 +08:00
|
|
|
return 0;
|
2009-12-18 10:24:26 +08:00
|
|
|
}
|
2009-12-18 10:24:26 +08:00
|
|
|
device_initcall(fanotify_user_setup);
|