License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2009-12-18 10:24:29 +08:00
|
|
|
#include <linux/fanotify.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/fdtable.h>
|
|
|
|
#include <linux/fsnotify_backend.h>
|
|
|
|
#include <linux/init.h>
|
2009-12-18 10:24:34 +08:00
|
|
|
#include <linux/jiffies.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/kernel.h> /* UINT_MAX */
|
2009-12-18 10:24:28 +08:00
|
|
|
#include <linux/mount.h>
|
2009-12-18 10:24:34 +08:00
|
|
|
#include <linux/sched.h>
|
2017-02-03 00:54:15 +08:00
|
|
|
#include <linux/sched/user.h>
|
2017-09-27 01:45:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
#include <linux/types.h>
|
2009-12-18 10:24:34 +08:00
|
|
|
#include <linux/wait.h>
|
2017-10-03 08:21:39 +08:00
|
|
|
#include <linux/audit.h>
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2019-01-11 01:04:34 +08:00
|
|
|
#include <linux/statfs.h>
|
2009-12-18 10:24:25 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
#include "fanotify.h"
|
|
|
|
|
|
|
|
static bool should_merge(struct fsnotify_event *old_fsn,
|
|
|
|
struct fsnotify_event *new_fsn)
|
2009-12-18 10:24:25 +08:00
|
|
|
{
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *old, *new;
|
2009-12-18 10:24:25 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
pr_debug("%s: old=%p new=%p\n", __func__, old_fsn, new_fsn);
|
|
|
|
old = FANOTIFY_E(old_fsn);
|
|
|
|
new = FANOTIFY_E(new_fsn);
|
|
|
|
|
2020-03-19 23:10:15 +08:00
|
|
|
if (old_fsn->objectid != new_fsn->objectid || old->pid != new->pid ||
|
2019-01-11 01:04:34 +08:00
|
|
|
old->fh_type != new->fh_type || old->fh_len != new->fh_len)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (fanotify_event_has_path(old)) {
|
|
|
|
return old->path.mnt == new->path.mnt &&
|
|
|
|
old->path.dentry == new->path.dentry;
|
|
|
|
} else if (fanotify_event_has_fid(old)) {
|
2019-01-11 01:04:44 +08:00
|
|
|
/*
|
|
|
|
* We want to merge many dirent events in the same dir (i.e.
|
|
|
|
* creates/unlinks/renames), but we do not want to merge dirent
|
|
|
|
* events referring to subdirs with dirent events referring to
|
|
|
|
* non subdirs, otherwise, user won't be able to tell from a
|
|
|
|
* mask FAN_CREATE|FAN_DELETE|FAN_ONDIR if it describes mkdir+
|
|
|
|
* unlink pair or rmdir+create pair of events.
|
|
|
|
*/
|
|
|
|
return (old->mask & FS_ISDIR) == (new->mask & FS_ISDIR) &&
|
|
|
|
fanotify_fid_equal(&old->fid, &new->fid, old->fh_len);
|
2019-01-11 01:04:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Do not merge events if we failed to encode fid */
|
2009-12-18 10:24:25 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
/* and the list better be locked by something too! */
|
2014-01-29 01:53:22 +08:00
|
|
|
static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
|
2009-12-18 10:24:25 +08:00
|
|
|
{
|
2014-01-22 07:48:14 +08:00
|
|
|
struct fsnotify_event *test_event;
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *new;
|
2009-12-18 10:24:25 +08:00
|
|
|
|
|
|
|
pr_debug("%s: list=%p event=%p\n", __func__, list, event);
|
2019-01-11 01:04:31 +08:00
|
|
|
new = FANOTIFY_E(event);
|
2009-12-18 10:24:25 +08:00
|
|
|
|
2014-01-29 01:29:24 +08:00
|
|
|
/*
|
|
|
|
* Don't merge a permission event with any other event so that we know
|
|
|
|
* the event structure we have created in fanotify_handle_event() is the
|
|
|
|
* one we should check for permission response.
|
|
|
|
*/
|
2019-01-11 01:04:31 +08:00
|
|
|
if (fanotify_is_perm_event(new->mask))
|
2014-01-29 01:53:22 +08:00
|
|
|
return 0;
|
2014-01-29 01:29:24 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
list_for_each_entry_reverse(test_event, list, list) {
|
|
|
|
if (should_merge(test_event, event)) {
|
2019-01-11 01:04:31 +08:00
|
|
|
FANOTIFY_E(test_event)->mask |= new->mask;
|
2017-02-09 20:45:22 +08:00
|
|
|
return 1;
|
2009-12-18 10:24:25 +08:00
|
|
|
}
|
|
|
|
}
|
2010-07-28 22:18:37 +08:00
|
|
|
|
2017-02-09 20:45:22 +08:00
|
|
|
return 0;
|
2009-12-18 10:24:25 +08:00
|
|
|
}
|
|
|
|
|
2019-01-08 22:18:02 +08:00
|
|
|
/*
|
|
|
|
* Wait for response to permission event. The function also takes care of
|
|
|
|
* freeing the permission event (or offloads that in case the wait is canceled
|
|
|
|
* by a signal). The function returns 0 in case access got allowed by userspace,
|
|
|
|
* -EPERM in case userspace disallowed the access, and -ERESTARTSYS in case
|
|
|
|
* the wait got interrupted by a signal.
|
|
|
|
*/
|
2014-04-04 05:46:33 +08:00
|
|
|
static int fanotify_get_response(struct fsnotify_group *group,
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_perm_event *event,
|
2016-11-11 00:45:16 +08:00
|
|
|
struct fsnotify_iter_info *iter_info)
|
2009-12-18 10:24:34 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
|
|
|
|
|
2019-02-21 18:47:23 +08:00
|
|
|
ret = wait_event_killable(group->fanotify_data.access_waitq,
|
|
|
|
event->state == FAN_EVENT_ANSWERED);
|
2019-01-08 22:18:02 +08:00
|
|
|
/* Signal pending? */
|
|
|
|
if (ret < 0) {
|
|
|
|
spin_lock(&group->notification_lock);
|
|
|
|
/* Event reported to userspace and no answer yet? */
|
|
|
|
if (event->state == FAN_EVENT_REPORTED) {
|
|
|
|
/* Event will get freed once userspace answers to it */
|
|
|
|
event->state = FAN_EVENT_CANCELED;
|
|
|
|
spin_unlock(&group->notification_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
/* Event not yet reported? Just remove it. */
|
|
|
|
if (event->state == FAN_EVENT_INIT)
|
|
|
|
fsnotify_remove_queued_event(group, &event->fae.fse);
|
|
|
|
/*
|
|
|
|
* Event may be also answered in case signal delivery raced
|
|
|
|
* with wakeup. In that case we have nothing to do besides
|
|
|
|
* freeing the event and reporting error.
|
|
|
|
*/
|
|
|
|
spin_unlock(&group->notification_lock);
|
|
|
|
goto out;
|
|
|
|
}
|
2009-12-18 10:24:34 +08:00
|
|
|
|
|
|
|
/* userspace responded, convert to something usable */
|
2017-10-03 08:21:39 +08:00
|
|
|
switch (event->response & ~FAN_AUDIT) {
|
2009-12-18 10:24:34 +08:00
|
|
|
case FAN_ALLOW:
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
case FAN_DENY:
|
|
|
|
default:
|
|
|
|
ret = -EPERM;
|
|
|
|
}
|
2017-10-03 08:21:39 +08:00
|
|
|
|
|
|
|
/* Check if the response should be audited */
|
|
|
|
if (event->response & FAN_AUDIT)
|
|
|
|
audit_fanotify(event->response & ~FAN_AUDIT);
|
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
|
|
|
|
group, event, ret);
|
2019-01-08 22:18:02 +08:00
|
|
|
out:
|
|
|
|
fsnotify_destroy_event(group, &event->fae.fse);
|
2019-01-11 01:04:42 +08:00
|
|
|
|
2009-12-18 10:24:34 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
fanotify: return only user requested event types in event mask
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.
As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.
To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.
The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-08 11:05:49 +08:00
|
|
|
/*
|
|
|
|
* This function returns a mask for an event that only contains the flags
|
|
|
|
* that have been specifically requested by the user. Flags that may have
|
|
|
|
* been included within the event mask, but have not been explicitly
|
|
|
|
* requested by the user, will not be present in the returned mask.
|
|
|
|
*/
|
2019-01-11 01:04:42 +08:00
|
|
|
static u32 fanotify_group_event_mask(struct fsnotify_group *group,
|
|
|
|
struct fsnotify_iter_info *iter_info,
|
|
|
|
u32 event_mask, const void *data,
|
|
|
|
int data_type)
|
2009-12-18 10:24:28 +08:00
|
|
|
{
|
2018-04-05 04:42:18 +08:00
|
|
|
__u32 marks_mask = 0, marks_ignored_mask = 0;
|
2019-01-11 01:04:44 +08:00
|
|
|
__u32 test_mask, user_mask = FANOTIFY_OUTGOING_EVENTS;
|
2020-03-19 23:10:12 +08:00
|
|
|
const struct path *path = fsnotify_data_path(data, data_type);
|
2018-04-21 07:10:54 +08:00
|
|
|
struct fsnotify_mark *mark;
|
|
|
|
int type;
|
2010-07-28 22:18:39 +08:00
|
|
|
|
2018-04-21 07:10:54 +08:00
|
|
|
pr_debug("%s: report_mask=%x mask=%x data=%p data_type=%d\n",
|
|
|
|
__func__, iter_info->report_mask, event_mask, data, data_type);
|
2010-07-28 22:18:39 +08:00
|
|
|
|
2019-01-11 01:04:42 +08:00
|
|
|
if (!FAN_GROUP_FLAG(group, FAN_REPORT_FID)) {
|
|
|
|
/* Do we have path to open a file descriptor? */
|
2020-03-19 23:10:12 +08:00
|
|
|
if (!path)
|
2019-01-11 01:04:42 +08:00
|
|
|
return 0;
|
|
|
|
/* Path type events are only relevant for files and dirs */
|
|
|
|
if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry))
|
|
|
|
return 0;
|
|
|
|
}
|
2010-10-29 05:21:58 +08:00
|
|
|
|
2018-04-21 07:10:54 +08:00
|
|
|
fsnotify_foreach_obj_type(type) {
|
|
|
|
if (!fsnotify_iter_should_report_type(iter_info, type))
|
|
|
|
continue;
|
|
|
|
mark = iter_info->marks[type];
|
|
|
|
/*
|
2018-10-31 02:29:53 +08:00
|
|
|
* If the event is for a child and this mark doesn't care about
|
|
|
|
* events on a child, don't send it!
|
2018-04-21 07:10:54 +08:00
|
|
|
*/
|
2018-10-31 02:29:53 +08:00
|
|
|
if (event_mask & FS_EVENT_ON_CHILD &&
|
|
|
|
(type != FSNOTIFY_OBJ_TYPE_INODE ||
|
|
|
|
!(mark->mask & FS_EVENT_ON_CHILD)))
|
2018-04-21 07:10:54 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
marks_mask |= mark->mask;
|
|
|
|
marks_ignored_mask |= mark->ignored_mask;
|
2010-07-28 22:18:39 +08:00
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:44 +08:00
|
|
|
test_mask = event_mask & marks_mask & ~marks_ignored_mask;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* dirent modification events (create/delete/move) do not carry the
|
|
|
|
* child entry name/inode information. Instead, we report FAN_ONDIR
|
|
|
|
* for mkdir/rmdir so user can differentiate them from creat/unlink.
|
|
|
|
*
|
|
|
|
* For backward compatibility and consistency, do not report FAN_ONDIR
|
|
|
|
* to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
|
|
|
|
* to user in FAN_REPORT_FID mode for all event types.
|
|
|
|
*/
|
|
|
|
if (FAN_GROUP_FLAG(group, FAN_REPORT_FID)) {
|
|
|
|
/* Do not report FAN_ONDIR without any event */
|
|
|
|
if (!(test_mask & ~FAN_ONDIR))
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
user_mask &= ~FAN_ONDIR;
|
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:41 +08:00
|
|
|
if (event_mask & FS_ISDIR &&
|
2015-02-11 06:08:27 +08:00
|
|
|
!(marks_mask & FS_ISDIR & ~marks_ignored_mask))
|
fanotify: return only user requested event types in event mask
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.
As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.
To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.
The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-08 11:05:49 +08:00
|
|
|
return 0;
|
2010-07-28 22:18:39 +08:00
|
|
|
|
2019-01-11 01:04:44 +08:00
|
|
|
return test_mask & user_mask;
|
2009-12-18 10:24:28 +08:00
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:34 +08:00
|
|
|
static int fanotify_encode_fid(struct fanotify_event *event,
|
2019-01-11 01:04:37 +08:00
|
|
|
struct inode *inode, gfp_t gfp,
|
|
|
|
__kernel_fsid_t *fsid)
|
2019-01-11 01:04:34 +08:00
|
|
|
{
|
|
|
|
struct fanotify_fid *fid = &event->fid;
|
|
|
|
int dwords, bytes = 0;
|
|
|
|
int err, type;
|
|
|
|
|
|
|
|
fid->ext_fh = NULL;
|
|
|
|
dwords = 0;
|
|
|
|
err = -ENOENT;
|
2019-01-11 01:04:37 +08:00
|
|
|
type = exportfs_encode_inode_fh(inode, NULL, &dwords, NULL);
|
2019-01-11 01:04:34 +08:00
|
|
|
if (!dwords)
|
|
|
|
goto out_err;
|
|
|
|
|
|
|
|
bytes = dwords << 2;
|
|
|
|
if (bytes > FANOTIFY_INLINE_FH_LEN) {
|
|
|
|
/* Treat failure to allocate fh as failure to allocate event */
|
|
|
|
err = -ENOMEM;
|
|
|
|
fid->ext_fh = kmalloc(bytes, gfp);
|
|
|
|
if (!fid->ext_fh)
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:37 +08:00
|
|
|
type = exportfs_encode_inode_fh(inode, fanotify_fid_fh(fid, bytes),
|
|
|
|
&dwords, NULL);
|
2019-01-11 01:04:34 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
if (!type || type == FILEID_INVALID || bytes != dwords << 2)
|
|
|
|
goto out_err;
|
|
|
|
|
2019-01-11 01:04:37 +08:00
|
|
|
fid->fsid = *fsid;
|
2019-01-11 01:04:34 +08:00
|
|
|
event->fh_len = bytes;
|
|
|
|
|
|
|
|
return type;
|
|
|
|
|
|
|
|
out_err:
|
|
|
|
pr_warn_ratelimited("fanotify: failed to encode fid (fsid=%x.%x, "
|
|
|
|
"type=%d, bytes=%d, err=%i)\n",
|
2019-01-11 01:04:37 +08:00
|
|
|
fsid->val[0], fsid->val[1], type, bytes, err);
|
2019-01-11 01:04:34 +08:00
|
|
|
kfree(fid->ext_fh);
|
|
|
|
fid->ext_fh = NULL;
|
|
|
|
event->fh_len = 0;
|
|
|
|
|
|
|
|
return FILEID_INVALID;
|
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:42 +08:00
|
|
|
/*
|
|
|
|
* The inode to use as identifier when reporting fid depends on the event.
|
|
|
|
* Report the modified directory inode on dirent modification events.
|
|
|
|
* Report the "victim" inode otherwise.
|
|
|
|
* For example:
|
|
|
|
* FS_ATTRIB reports the child inode even if reported on a watched parent.
|
|
|
|
* FS_CREATE reports the modified dir inode and not the created inode.
|
|
|
|
*/
|
|
|
|
static struct inode *fanotify_fid_inode(struct inode *to_tell, u32 event_mask,
|
|
|
|
const void *data, int data_type)
|
|
|
|
{
|
|
|
|
if (event_mask & ALL_FSNOTIFY_DIRENT_EVENTS)
|
|
|
|
return to_tell;
|
2020-03-19 23:10:12 +08:00
|
|
|
|
|
|
|
return (struct inode *)fsnotify_data_inode(data, data_type);
|
2019-01-11 01:04:42 +08:00
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
|
2019-01-11 01:04:37 +08:00
|
|
|
struct inode *inode, u32 mask,
|
2019-01-11 01:04:42 +08:00
|
|
|
const void *data, int data_type,
|
2019-01-11 01:04:37 +08:00
|
|
|
__kernel_fsid_t *fsid)
|
2014-04-04 05:46:33 +08:00
|
|
|
{
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *event = NULL;
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
gfp_t gfp = GFP_KERNEL_ACCOUNT;
|
2019-01-11 01:04:42 +08:00
|
|
|
struct inode *id = fanotify_fid_inode(inode, mask, data, data_type);
|
2020-03-19 23:10:12 +08:00
|
|
|
const struct path *path = fsnotify_data_path(data, data_type);
|
2018-02-21 21:10:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For queues with unlimited length lost events are not expected and
|
|
|
|
* can possibly have security implications. Avoid losing events when
|
2019-07-12 11:55:52 +08:00
|
|
|
* memory is short. For the limited size queues, avoid OOM killer in the
|
|
|
|
* target monitoring memcg as it may have security repercussion.
|
2018-02-21 21:10:59 +08:00
|
|
|
*/
|
|
|
|
if (group->max_events == UINT_MAX)
|
|
|
|
gfp |= __GFP_NOFAIL;
|
2019-07-12 11:55:52 +08:00
|
|
|
else
|
|
|
|
gfp |= __GFP_RETRY_MAYFAIL;
|
2014-04-04 05:46:33 +08:00
|
|
|
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
/* Whoever is interested in the event, pays for the allocation. */
|
|
|
|
memalloc_use_memcg(group->memcg);
|
|
|
|
|
2017-10-31 04:14:56 +08:00
|
|
|
if (fanotify_is_perm_event(mask)) {
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_perm_event *pevent;
|
2014-04-04 05:46:33 +08:00
|
|
|
|
2018-02-21 21:10:59 +08:00
|
|
|
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
|
2014-04-04 05:46:33 +08:00
|
|
|
if (!pevent)
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
goto out;
|
2014-04-04 05:46:33 +08:00
|
|
|
event = &pevent->fae;
|
|
|
|
pevent->response = 0;
|
2019-01-08 21:02:44 +08:00
|
|
|
pevent->state = FAN_EVENT_INIT;
|
2014-04-04 05:46:33 +08:00
|
|
|
goto init;
|
|
|
|
}
|
2018-02-21 21:10:59 +08:00
|
|
|
event = kmem_cache_alloc(fanotify_event_cachep, gfp);
|
2014-04-04 05:46:33 +08:00
|
|
|
if (!event)
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
goto out;
|
2014-04-04 05:46:33 +08:00
|
|
|
init: __maybe_unused
|
2020-03-19 23:10:15 +08:00
|
|
|
fsnotify_init_event(&event->fse, (unsigned long)inode);
|
2019-01-11 01:04:31 +08:00
|
|
|
event->mask = mask;
|
2018-10-04 05:25:38 +08:00
|
|
|
if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
|
|
|
|
event->pid = get_pid(task_pid(current));
|
|
|
|
else
|
|
|
|
event->pid = get_pid(task_tgid(current));
|
2019-01-11 01:04:34 +08:00
|
|
|
event->fh_len = 0;
|
2019-01-11 01:04:42 +08:00
|
|
|
if (id && FAN_GROUP_FLAG(group, FAN_REPORT_FID)) {
|
2019-01-11 01:04:34 +08:00
|
|
|
/* Report the event without a file identifier on encode error */
|
2019-01-11 01:04:42 +08:00
|
|
|
event->fh_type = fanotify_encode_fid(event, id, gfp, fsid);
|
2020-03-19 23:10:12 +08:00
|
|
|
} else if (path) {
|
2019-01-11 01:04:34 +08:00
|
|
|
event->fh_type = FILEID_ROOT;
|
2020-03-19 23:10:12 +08:00
|
|
|
event->path = *path;
|
|
|
|
path_get(path);
|
2014-04-04 05:46:33 +08:00
|
|
|
} else {
|
2019-01-11 01:04:34 +08:00
|
|
|
event->fh_type = FILEID_INVALID;
|
2014-04-04 05:46:33 +08:00
|
|
|
event->path.mnt = NULL;
|
|
|
|
event->path.dentry = NULL;
|
|
|
|
}
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
out:
|
|
|
|
memalloc_unuse_memcg();
|
2014-04-04 05:46:33 +08:00
|
|
|
return event;
|
|
|
|
}
|
|
|
|
|
2019-01-11 01:04:37 +08:00
|
|
|
/*
|
|
|
|
* Get cached fsid of the filesystem containing the object from any connector.
|
|
|
|
* All connectors are supposed to have the same fsid, but we do not verify that
|
|
|
|
* here.
|
|
|
|
*/
|
|
|
|
static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
|
|
|
|
{
|
|
|
|
int type;
|
|
|
|
__kernel_fsid_t fsid = {};
|
|
|
|
|
|
|
|
fsnotify_foreach_obj_type(type) {
|
2019-04-25 00:39:57 +08:00
|
|
|
struct fsnotify_mark_connector *conn;
|
|
|
|
|
2019-01-11 01:04:37 +08:00
|
|
|
if (!fsnotify_iter_should_report_type(iter_info, type))
|
|
|
|
continue;
|
|
|
|
|
2019-04-25 00:39:57 +08:00
|
|
|
conn = READ_ONCE(iter_info->marks[type]->connector);
|
|
|
|
/* Mark is just getting destroyed or created? */
|
|
|
|
if (!conn)
|
|
|
|
continue;
|
2019-06-19 18:34:44 +08:00
|
|
|
if (!(conn->flags & FSNOTIFY_CONN_FLAG_HAS_FSID))
|
|
|
|
continue;
|
|
|
|
/* Pairs with smp_wmb() in fsnotify_add_mark_list() */
|
|
|
|
smp_rmb();
|
2019-04-25 00:39:57 +08:00
|
|
|
fsid = conn->fsid;
|
2019-01-11 01:04:37 +08:00
|
|
|
if (WARN_ON_ONCE(!fsid.val[0] && !fsid.val[1]))
|
|
|
|
continue;
|
|
|
|
return fsid;
|
|
|
|
}
|
|
|
|
|
|
|
|
return fsid;
|
|
|
|
}
|
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
static int fanotify_handle_event(struct fsnotify_group *group,
|
|
|
|
struct inode *inode,
|
2016-11-21 09:19:09 +08:00
|
|
|
u32 mask, const void *data, int data_type,
|
2019-04-27 01:51:03 +08:00
|
|
|
const struct qstr *file_name, u32 cookie,
|
2016-11-11 00:51:50 +08:00
|
|
|
struct fsnotify_iter_info *iter_info)
|
2014-01-22 07:48:14 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *event;
|
2014-01-22 07:48:14 +08:00
|
|
|
struct fsnotify_event *fsn_event;
|
2019-01-11 01:04:37 +08:00
|
|
|
__kernel_fsid_t fsid = {};
|
2014-01-22 07:48:14 +08:00
|
|
|
|
|
|
|
BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
|
|
|
|
BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
|
2019-01-11 01:04:43 +08:00
|
|
|
BUILD_BUG_ON(FAN_ATTRIB != FS_ATTRIB);
|
2014-01-22 07:48:14 +08:00
|
|
|
BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
|
|
|
|
BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
|
|
|
|
BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
|
2019-01-11 01:04:43 +08:00
|
|
|
BUILD_BUG_ON(FAN_MOVED_TO != FS_MOVED_TO);
|
|
|
|
BUILD_BUG_ON(FAN_MOVED_FROM != FS_MOVED_FROM);
|
|
|
|
BUILD_BUG_ON(FAN_CREATE != FS_CREATE);
|
|
|
|
BUILD_BUG_ON(FAN_DELETE != FS_DELETE);
|
|
|
|
BUILD_BUG_ON(FAN_DELETE_SELF != FS_DELETE_SELF);
|
|
|
|
BUILD_BUG_ON(FAN_MOVE_SELF != FS_MOVE_SELF);
|
2014-01-22 07:48:14 +08:00
|
|
|
BUILD_BUG_ON(FAN_EVENT_ON_CHILD != FS_EVENT_ON_CHILD);
|
|
|
|
BUILD_BUG_ON(FAN_Q_OVERFLOW != FS_Q_OVERFLOW);
|
|
|
|
BUILD_BUG_ON(FAN_OPEN_PERM != FS_OPEN_PERM);
|
|
|
|
BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
|
|
|
|
BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
|
2018-11-08 11:07:14 +08:00
|
|
|
BUILD_BUG_ON(FAN_OPEN_EXEC != FS_OPEN_EXEC);
|
2018-11-08 11:12:44 +08:00
|
|
|
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
|
2014-01-22 07:48:14 +08:00
|
|
|
|
2019-01-11 01:04:43 +08:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
|
2018-10-04 05:25:37 +08:00
|
|
|
|
2019-01-11 01:04:42 +08:00
|
|
|
mask = fanotify_group_event_mask(group, iter_info, mask, data,
|
|
|
|
data_type);
|
fanotify: return only user requested event types in event mask
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.
As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.
To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.
The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-08 11:05:49 +08:00
|
|
|
if (!mask)
|
2014-01-22 07:48:15 +08:00
|
|
|
return 0;
|
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
|
|
|
|
mask);
|
|
|
|
|
2017-10-31 04:14:56 +08:00
|
|
|
if (fanotify_is_perm_event(mask)) {
|
2017-10-31 04:14:56 +08:00
|
|
|
/*
|
|
|
|
* fsnotify_prepare_user_wait() fails if we race with mark
|
|
|
|
* deletion. Just let the operation pass in that case.
|
|
|
|
*/
|
|
|
|
if (!fsnotify_prepare_user_wait(iter_info))
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-04-25 00:39:57 +08:00
|
|
|
if (FAN_GROUP_FLAG(group, FAN_REPORT_FID)) {
|
2019-01-11 01:04:37 +08:00
|
|
|
fsid = fanotify_get_fsid(iter_info);
|
2019-04-25 00:39:57 +08:00
|
|
|
/* Racing with mark destruction or creation? */
|
|
|
|
if (!fsid.val[0] && !fsid.val[1])
|
|
|
|
return 0;
|
|
|
|
}
|
2019-01-11 01:04:37 +08:00
|
|
|
|
2019-01-11 01:04:42 +08:00
|
|
|
event = fanotify_alloc_event(group, inode, mask, data, data_type,
|
|
|
|
&fsid);
|
2017-10-31 04:14:56 +08:00
|
|
|
ret = -ENOMEM;
|
2018-02-21 22:07:52 +08:00
|
|
|
if (unlikely(!event)) {
|
|
|
|
/*
|
|
|
|
* We don't queue overflow events for permission events as
|
|
|
|
* there the access is denied and so no event is in fact lost.
|
|
|
|
*/
|
|
|
|
if (!fanotify_is_perm_event(mask))
|
|
|
|
fsnotify_queue_overflow(group);
|
2017-10-31 04:14:56 +08:00
|
|
|
goto finish;
|
2018-02-21 22:07:52 +08:00
|
|
|
}
|
2014-01-22 07:48:14 +08:00
|
|
|
|
|
|
|
fsn_event = &event->fse;
|
2014-08-07 07:03:26 +08:00
|
|
|
ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
|
2014-01-29 01:53:22 +08:00
|
|
|
if (ret) {
|
2014-02-22 02:07:54 +08:00
|
|
|
/* Permission events shouldn't be merged */
|
2018-10-04 05:25:35 +08:00
|
|
|
BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
|
2014-01-22 07:48:14 +08:00
|
|
|
/* Our event wasn't used in the end. Free it. */
|
|
|
|
fsnotify_destroy_event(group, fsn_event);
|
2014-02-22 02:07:54 +08:00
|
|
|
|
2017-10-31 04:14:56 +08:00
|
|
|
ret = 0;
|
2017-10-31 04:14:56 +08:00
|
|
|
} else if (fanotify_is_perm_event(mask)) {
|
2016-11-11 00:45:16 +08:00
|
|
|
ret = fanotify_get_response(group, FANOTIFY_PE(fsn_event),
|
|
|
|
iter_info);
|
2014-01-29 04:38:06 +08:00
|
|
|
}
|
2017-10-31 04:14:56 +08:00
|
|
|
finish:
|
2017-10-31 04:14:56 +08:00
|
|
|
if (fanotify_is_perm_event(mask))
|
2017-10-31 04:14:56 +08:00
|
|
|
fsnotify_finish_user_wait(iter_info);
|
2017-10-31 04:14:56 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-10-29 05:21:58 +08:00
|
|
|
static void fanotify_free_group_priv(struct fsnotify_group *group)
|
|
|
|
{
|
|
|
|
struct user_struct *user;
|
|
|
|
|
|
|
|
user = group->fanotify_data.user;
|
|
|
|
atomic_dec(&user->fanotify_listeners);
|
|
|
|
free_uid(user);
|
|
|
|
}
|
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
static void fanotify_free_event(struct fsnotify_event *fsn_event)
|
|
|
|
{
|
2019-01-11 01:04:32 +08:00
|
|
|
struct fanotify_event *event;
|
2014-01-22 07:48:14 +08:00
|
|
|
|
|
|
|
event = FANOTIFY_E(fsn_event);
|
2019-01-11 01:04:34 +08:00
|
|
|
if (fanotify_event_has_path(event))
|
|
|
|
path_put(&event->path);
|
|
|
|
else if (fanotify_event_has_ext_fh(event))
|
|
|
|
kfree(event->fid.ext_fh);
|
2018-10-04 05:25:38 +08:00
|
|
|
put_pid(event->pid);
|
2019-01-11 01:04:31 +08:00
|
|
|
if (fanotify_is_perm_event(event->mask)) {
|
2014-04-04 05:46:33 +08:00
|
|
|
kmem_cache_free(fanotify_perm_event_cachep,
|
|
|
|
FANOTIFY_PE(fsn_event));
|
|
|
|
return;
|
|
|
|
}
|
2014-01-22 07:48:14 +08:00
|
|
|
kmem_cache_free(fanotify_event_cachep, event);
|
|
|
|
}
|
|
|
|
|
2016-12-22 01:06:12 +08:00
|
|
|
static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
|
|
|
|
{
|
|
|
|
kmem_cache_free(fanotify_mark_cache, fsn_mark);
|
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:25 +08:00
|
|
|
const struct fsnotify_ops fanotify_fsnotify_ops = {
|
|
|
|
.handle_event = fanotify_handle_event,
|
2010-10-29 05:21:58 +08:00
|
|
|
.free_group_priv = fanotify_free_group_priv,
|
2014-01-22 07:48:14 +08:00
|
|
|
.free_event = fanotify_free_event,
|
2016-12-22 01:06:12 +08:00
|
|
|
.free_mark = fanotify_free_mark,
|
2009-12-18 10:24:25 +08:00
|
|
|
};
|