Mostly supporting filesystems outside of init_user_ns is
s/&init_usre_ns/dquot->dq_sb->s_user_ns/. An actual need for
supporting quotas on filesystems outside of s_user_ns is quite a ways
away and to be done responsibily needs an audit on what can happen
with hostile quota files. Until that audit is complete don't attempt
to support quota files on filesystems outside of s_user_ns.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.
In practice for quota supporting filesystems today this is the exact
same check as qid_valid. As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.
Replace the qid_valid calls with qid_has_mapping as values come in
from userspace. This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.
Call qid_has_mapping from dqget. This ensures the passed in qid has a
prepresentation on the underlying filesystem. Previously this was
unnecessary as filesystesm never had qids that could not map. With
the introduction of filesystems outside of s_user_ns this will not
remain true.
All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It is expected that filesystems can not represent uids and gids from
outside of their user namespace. Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.
The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.
Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk. In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.
Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.
In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set. Reads of the inodes appear safe and
useful, but any write or modification is disallowed. The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values. After such a chown the inode is normal and may
be treated as such.
Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.
One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.
Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr. As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed. evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Update posix_acl_valid to verify that an acl is within a user namespace.
Update the callers of posix_acl_valid to pass in an appropriate
user namespace. For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid. For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Refuse to admit any user namespace has a mapping of the INVALID_UID
and the INVALID_GID when !CONFIG_USER_NS.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.
This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that
--EWB Moved the test from inode_change_ok to notify_change
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Security labels from unprivileged mounts in user namespaces must
be ignored. Force superblocks from user namespaces whose labeling
behavior is to use xattrs to use mountpoint labeling instead.
For the mountpoint label, default to converting the current task
context into a form suitable for file objects, but also allow the
policy writer to specify a different label through policy
transition rules.
Pieced together from code snippets provided by Stephen Smalley.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled
differently in untrusted mounts. This is confusing and
potentically problematic. Change this to handle them all the same
way that SMACK64 is currently handled; that is, read the label
from disk and check it at use time. For SMACK64 and SMACK64MMAP
access is denied if the label does not match smk_root. To be
consistent with suid, a SMACK64EXEC label which does not match
smk_root will still allow execution of the file but will not run
with the label supplied in the xattr.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Security labels from unprivileged mounts cannot be trusted.
Ideally for these mounts we would assign the objects in the
filesystem the same label as the inode for the backing device
passed to mount. Unfortunately it's currently impossible to
determine which inode this is from the LSM mount hooks, so we
settle for the label of the process doing the mount.
This label is assigned to s_root, and also to smk_default to
ensure that new inodes receive this label. The transmute property
is also set on s_root to make this behavior more explicit, even
though it is technically not necessary.
If a filesystem has existing security labels, access to inodes is
permitted if the label is the same as smk_root, otherwise access
is denied. The SMACK64EXEC xattr is completely ignored.
Explicit setting of security labels continues to require
CAP_MAC_ADMIN in init_user_ns.
Altogether, this ensures that filesystem objects are not
accessible to subjects which cannot already access the backing
store, that MAC is not violated for any objects in the fileystem
which are already labeled, and that a user cannot use an
unprivileged mount to gain elevated MAC privileges.
sysfs, tmpfs, and ramfs are already mountable from user
namespaces and support security labels. We can't rule out the
possibility that these filesystems may already be used in mounts
from user namespaces with security lables set from the init
namespace, so failing to trust lables in these filesystems may
introduce regressions. It is safe to trust labels from these
filesystems, since the unprivileged user does not control the
backing store and thus cannot supply security labels, so an
explicit exception is made to trust labels from these
filesystems.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.
This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.
This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.
As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.
On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.
As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, current_in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
--EWB Replaced in_userns with the simpler current_in_userns.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that SB_I_NODEV controls the nodev behavior devpts can just clear
this flag during mount. Simplifying the code and making it easier
to audit how the code works. While still preserving the invariant
that s_iflags is only modified during mount.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the implict setting of MNT_NODEV on mounts that happen with
just user namespace permissions with an implicit setting of SB_I_NODEV
in s_iflags. The visibility of the implicit MNT_NODEV has caused
problems in the past.
With this change the fragile case where an implicit MNT_NODEV needs to
be preserved in do_remount is removed. Using SB_I_NODEV is much less
fragile as s_iflags are set during the original mount and never
changed.
In do_new_mount with the implicit setting of MNT_NODEV gone, the only
code that can affect mnt_flags is fs_fully_visible so simplify the if
statement and reduce the indentation of the code to make that clear.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Verify all filesystems that we check in mount_too_revealing set
SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today
and it should remain true in the future.
Remove the now unnecessary checks from mnt_already_visibile that
ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
preserved. Making the code shorter and easier to read.
Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
and several slightly buggy container applications don't bother to
set those flags continue to work.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.
Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
do not result in executable on mqueuefs by accident.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cgroup filesystem is in the same boat as sysfs. No one ever
permits executables of any kind on the cgroup filesystem, and there is
no reasonable future case to support executables in the future.
Therefore move the setting of SB_I_NOEXEC which makes the code proof
against future mistakes of accidentally creating executables from
sysfs to kernfs itself. Making the code simpler and covering the
sysfs, cgroup, and cgroup2 filesystems.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Allowing a filesystem to be mounted by other than root in the initial
user namespace is a filesystem property not a mount namespace property
and as such should be checked in filesystem specific code. Move the
FS_USERNS_MOUNT test into super.c:sget_userns().
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Start marking filesystems with a user namespace owner, s_user_ns. In
this change this is only used for permission checks of who may mount a
filesystem. Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.
The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
user_ns.
The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a
filesystem.
The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be
implemented.
The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information. The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.
For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.
For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.
Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.
Getting the details of setting s_user_ns correct has been
a long process. Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns. Thanks to Seth Forshee who has kept the work alive.
Thanks-to: Seth Forshee <seth.forshee@canonical.com>
Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Move the call of get_pid_ns, the call of proc_parse_options, and
the setting of s_iflags into proc_fill_super so that mount_ns
can be used.
Convert proc_mount to call mount_ns and remove the now unnecessary
code.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.
Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.
Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.
Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Allow the ipc namespace initialization code to depend on ns->user_ns
being set during initialization.
In particular this allows mq_init_ns to use ns->user_ns for permission
checks and initializating s_user_ns while the the mq filesystem is
being mounted.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Suggested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the call of fs_fully_visible in do_new_mount from before the
new superblock is allocated with a call of mount_too_revealing after
the superblock is allocated. This winds up being a much better location
for maintainability of the code.
The first change this enables is the replacement of FS_USERNS_VISIBLE
with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type
to sb_iflags on the superblock.
Unfortunately mount_too_revealing fundamentally needs to touch
mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
times. If the mnt_flags did not need to be touched the code
could be easily moved into the filesystem specific mount code.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In rare cases it is possible for s_flags & MS_RDONLY to be set but
MNT_READONLY to be clear. This starting combination can cause
fs_fully_visible to fail to ensure that the new mount is readonly.
Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
is set on the source filesystem of the mount.
In general both MS_RDONLY and MNT_READONLY are set at the same for
mounts so I don't expect any programs to care. Nor do I expect
MS_RDONLY to be set on proc or sysfs in the initial user namespace,
which further decreases the likelyhood of problems.
Which means this change should only affect system configurations by
paranoid sysadmins who should welcome the additional protection
as it keeps people from wriggling out of their policies.
Cc: stable@vger.kernel.org
Fixes: 8c6cf9cc82 ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
MNT_LOCKED implies on a child mount implies the child is locked to the
parent. So while looping through the children the children should be
tested (not their parent).
Typically an unshare of a mount namespace locks all mounts together
making both the parent and the slave as locked but there are a few
corner cases where other things work.
Cc: stable@vger.kernel.org
Fixes: ceeb0e5d39 ("vfs: Ignore unlocked mounts in fs_fully_visible")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add this trivial missing error handling.
Cc: stable@vger.kernel.org
Fixes: 1b852bceb0 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull parisc fixes from Helge Deller:
- Fix printk time stamps on SMP systems which got wrong due to a patch
which was added during the merge window
- Fix two bugs in the stack backtrace code: Races in module unloading
and possible invalid accesses to memory due to wrong instruction
decoding (Mikulas Patocka)
- Fix userspace crash when syscalls access invalid unaligned userspace
addresses. Those syscalls will now return EFAULT as expected.
(tagged for stable kernel series)
* 'parisc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Move die_if_kernel() prototype into traps.h header
parisc: Fix pagefault crash in unaligned __get_user() call
parisc: Fix printk time during boot
parisc: Fix backtrace on PA-RISC
Pull key handling update from James Morris:
"This alters a new keyctl function added in the current merge window to
allow for a future extension planned for the next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
KEYS: Add placeholder for KDF usage with DH
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the debian buildd servers had this crash in the syslog without
any other information:
Unaligned handler failed, ret = -2
clock_adjtime (pid 22578): Unaligned data reference (code 28)
CPU: 1 PID: 22578 Comm: clock_adjtime Tainted: G E 4.5.0-2-parisc64-smp #1 Debian 4.5.4-1
task: 000000007d9960f8 ti: 00000001bde7c000 task.ti: 00000001bde7c000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111100000001111 Tainted: G E
r00-03 000000ff0804f80f 00000001bde7c2b0 00000000402d2be8 00000001bde7c2b0
r04-07 00000000409e1fd0 00000000fa6f7fff 00000001bde7c148 00000000fa6f7fff
r08-11 0000000000000000 00000000ffffffff 00000000fac9bb7b 000000000002b4d4
r12-15 000000000015241c 000000000015242c 000000000000002d 00000000fac9bb7b
r16-19 0000000000028800 0000000000000001 0000000000000070 00000001bde7c218
r20-23 0000000000000000 00000001bde7c210 0000000000000002 0000000000000000
r24-27 0000000000000000 0000000000000000 00000001bde7c148 00000000409e1fd0
r28-31 0000000000000001 00000001bde7c320 00000001bde7c350 00000001bde7c218
sr00-03 0000000001200000 0000000001200000 0000000000000000 0000000001200000
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d2e84 00000000402d2e88
IIR: 0ca0d089 ISR: 0000000001200000 IOR: 00000000fa6f7fff
CPU: 1 CR30: 00000001bde7c000 CR31: ffffffffffffffff
ORIG_R28: 00000002369fe628
IAOQ[0]: compat_get_timex+0x2dc/0x3c0
IAOQ[1]: compat_get_timex+0x2e0/0x3c0
RP(r2): compat_get_timex+0x40/0x3c0
Backtrace:
[<00000000402d4608>] compat_SyS_clock_adjtime+0x40/0xc0
[<0000000040205024>] syscall_exit+0x0/0x14
This means the userspace program clock_adjtime called the clock_adjtime()
syscall and then crashed inside the compat_get_timex() function.
Syscalls should never crash programs, but instead return EFAULT.
The IIR register contains the executed instruction, which disassebles
into "ldw 0(sr3,r5),r9".
This load-word instruction is part of __get_user() which tried to read the word
at %r5/IOR (0xfa6f7fff). This means the unaligned handler jumped in. The
unaligned handler is able to emulate all ldw instructions, but it fails if it
fails to read the source e.g. because of page fault.
The following program reproduces the problem:
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/mman.h>
int main(void) {
/* allocate 8k */
char *ptr = mmap(NULL, 2*4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
/* free second half (upper 4k) and make it invalid. */
munmap(ptr+4096, 4096);
/* syscall where first int is unaligned and clobbers into invalid memory region */
/* syscall should return EFAULT */
return syscall(__NR_clock_adjtime, 0, ptr+4095);
}
To fix this issue we simply need to check if the faulting instruction address
is in the exception fixup table when the unaligned handler failed. If it
is, call the fixup routine instead of crashing.
While looking at the unaligned handler I found another issue as well: The
target register should not be modified if the handler was unsuccessful.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org
This patch fixes backtrace on PA-RISC
There were several problems:
1) The code that decodes instructions handles instructions that subtract
from the stack pointer incorrectly. If the instruction subtracts the
number X from the stack pointer the code increases the frame size by
(0x100000000-X). This results in invalid accesses to memory and
recursive page faults.
2) Because gcc reorders blocks, handling instructions that subtract from
the frame pointer is incorrect. For example, this function
int f(int a)
{
if (__builtin_expect(a, 1))
return a;
g();
return a;
}
is compiled in such a way, that the code that decreases the stack
pointer for the first "return a" is placed before the code for "g" call.
If we recognize this decrement, we mistakenly believe that the frame
size for the "g" call is zero.
To fix problems 1) and 2), the patch doesn't recognize instructions that
decrease the stack pointer at all. To further safeguard the unwind code
against nonsense values, we don't allow frame size larger than
Total_frame_size.
3) The backtrace is not locked. If stack dump races with module unload,
invalid table can be accessed.
This patch adds a spinlock when processing module tables.
Note, that for correct backtrace, you need recent binutils.
Binutils 2.18 from Debian 5 produce garbage unwind tables.
Binutils 2.21 work better (it sometimes forgets function frames, but at
least it doesn't generate garbage).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Pull drm fixes from Dave Airlie:
"A bunch of ARM drivers got into the fixes vibe this time around, so
this contains a bunch of fixes for imx, atmel hlcdc, arm hdlcd (only
so many combos of hlcd), mediatek and omap drm.
Other than that there is one mgag200 fix and a few core drm regression
fixes"
* tag 'drm-fixes-for-v4.7-rc2' of git://people.freedesktop.org/~airlied/linux: (34 commits)
drm/omap: fix unused variable warning.
drm: hdlcd: Add information about the underlying framebuffers in debugfs
drm: hdlcd: Cleanup the atomic plane operations
drm/hdlcd: Fix up crtc_state->event handling
drm: hdlcd: Revamp runtime power management
drm/mediatek: mtk_dsi: Remove spurious drm_connector_unregister
drm/mediatek: mtk_dpi: remove invalid error message
drm: atmel-hlcdc: fix a NULL check
drm: atmel-hlcdc: fix atmel_hlcdc_crtc_reset() implementation
drm/mgag200: Black screen fix for G200e rev 4
drm: Wrap direct calls to driver->gem_free_object from CMA
drm: fix fb refcount issue with atomic modesetting
drm: make drm_atomic_set_mode_prop_for_crtc() more reliable
drm/sti: remove extra mode fixup
drm: add missing drm_mode_set_crtcinfo call
drm/omap: include gpio/consumer.h where needed
drm/omap: include linux/seq_file.h where needed
Revert "drm/omap: no need to select OMAP2_DSS"
drm/omap: Remove regulator API abuse
OMAPDSS: HDMI5: Change DDC timings
...
Pull btrfs fixes from Chris Mason:
"The important part of this pull is Filipe's set of fixes for btrfs
device replacement. Filipe fixed a few issues seen on the list and a
number he found on his own"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
Btrfs: fix race between device replace and read repair
Btrfs: fix race between device replace and discard
Btrfs: fix race between device replace and chunk allocation
Btrfs: fix race setting block group back to RW mode during device replace
Btrfs: fix unprotected assignment of the left cursor for device replace
Btrfs: fix race setting block group readonly during device replace
Btrfs: fix race between device replace and block group removal
Btrfs: fix race between readahead and device replace/removal
Pull Ceph fixes from Sage Weil:
"We have a few follow-up fixes for the libceph refactor from Ilya, and
then some cephfs + fscache fixes from Zheng.
The first two FS-Cache patches are acked by David Howells and deemed
trivial enough to go through our tree. The rest fix some issues with
the ceph fscache handling (disable cache for inodes opened for write,
and simplify the revalidation logic accordingly, dropping the
now-unnecessary work queue)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use i_version to check validity of fscache
ceph: improve fscache revalidation
ceph: disable fscache when inode is opened for write
ceph: avoid unnecessary fscache invalidation/revlidation
ceph: call __fscache_uncache_page() if readpages fails
FS-Cache: make check_consistency callback return int
FS-Cache: wake write waiter after invalidating writes
libceph: use %s instead of %pE in dout()s
libceph: put request only if it's done in handle_reply()
libceph: change ceph_osdmap_flag() to take osdc
- Fix an incorrect check introduced by recent ACPICA changes which
causes problems with booting KVM guests to happen, among other
things (Lv Zheng).
- Fix a backlight issue introduced by recent changes to the ACPI
video driver (Aaron Lu).
- Fix the ACPI processor initialization which attempts to register
an IO region without checking if that really is necessary and
sometimes prevents drivers loaded subsequently from registering
their resources which leads to boot issues (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXUfDtAAoJEILEb/54YlRx/MoP/3qNp0alfSUO0za5WbWoCruX
UmX8fUuSATOZ0nQdsi3ALqC51ZhD/1dxZtHLVlzUYPg7tKkb4Lf3MratcKex/IxT
xIxcvKeMOZL04j6xjU24DaMqEtHfcPTJgLQKDWo3Ek+sEYhNrXaOBMiDeuEmOZUX
VuDQbVGpV05VaouC4jHdNcuq2xNIHPvY8tcFf+11JvIXv+mJITRt6PgP7kLRqjqG
jJrUe5l/SjgMqqb+9BodfoHC3EQw9+dptlwCKjkhEccjvveB19n6ZUW9b/l4ZdQS
OSHNZdP/y402v6IocWvvyVF0AdMOaoBlecSKO5zmv1hyb9ewKPEoSWaqCCwaNvAf
GHgev/E6Uum3nE6hajXvFpQBsWc4toUnqoAwN6D7O4YlA+gmdHAtNrt5S+UhozMQ
0Yk7Nen/ko4Y8ba+Y4suM1La8u/UvNeTuxv0xJv2+r94SzaeGtTTLAR2Md/3uMTm
W2+m9hfFE74Vh9Itf6a5gtWZLDep3CjEhT0p6PDks82mU9GRp43flztpPWiMmeKP
KSTGNe9xwtpMyU+HmWoN5pKdB5WLkv1o7xO+jXYL2+3L1+PiTinm+8d2UZpdDAD2
YK3t6Z5HmjFTxR4TBenftVo7xcXZ+AerymeuJ+uf1wJbvgwZSzCxmJBIJF9qFdYA
xUeYFRhzLH3gUuyEWyaf
=WcEx
-----END PGP SIGNATURE-----
Merge tag 'acpi-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Two fixes for problems introduced recently (ACPICA and the ACPI
backlight driver) and one fix for an older issue that prevents at
least one system from booting.
Specifics:
- Fix an incorrect check introduced by recent ACPICA changes which
causes problems with booting KVM guests to happen, among other
things (Lv Zheng).
- Fix a backlight issue introduced by recent changes to the ACPI
video driver (Aaron Lu).
- Fix the ACPI processor initialization which attempts to register an
IO region without checking if that really is necessary and
sometimes prevents drivers loaded subsequently from registering
their resources which leads to boot issues (Rafael Wysocki)"
* tag 'acpi-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / processor: Avoid reserving IO regions too early
ACPICA / Hardware: Fix old register check in acpi_hw_get_access_bit_width()
ACPI / Thermal / video: fix max_level incorrect value
- Fix a silly mistake related to the clamp_val() usage in
a function added by a recent commit (Rafael Wysocki).
- Reduce the log level of an annoying message added to intel_pstate
during the recent merge window (Srinivas Pandruvada).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXUfCsAAoJEILEb/54YlRx7JAQAK1i3njQlTPQlPeikVcZQT4Y
/yVjQGbEVtAk8okMuD/qINXjuFmy85sjoNgf/JVuFKN/vuBtQQfkT/VcXHoKcgHV
DcoMgGxY6CDLZtGp7GNUlnrOzYURblnu0TUorMBEtd/j1YfG0VFvMf+w/k1vb/I2
daH4op2i434Ha6gw5VkS2WBn7CONvnJM1pbG4ptddAJlU+SgEcqA0w4ihLAGqDq6
JI9laFofZxuI8Vaj6pfriCawDGa5OAxUcSUAKBqyLbgtdTE4YImzvCU1i364tdqY
ssbIiA5jIFc61VDOGGOEJxUTUfh9SrgA1OrYnfTdkkOjpZTCmH8JOpoWJk5loa4R
YSIl6Mh6vvYFike1N0dgEld+RTj6L7lyPTj0Kw6BVT4ozS15dmBgQdezUJpI678q
W1KaZ/8c3p9oa3rjy7lIdDYgdW3IgDOfM5Gz6WALx9oePVmPU0p0OEwhmIY9QflK
yOEJhGB/FEvGfPj4dmDqVzsGeWKQ8Zb/k57km0S86oesENzsFzBucaNcA878HpSH
WvZhvZPcg9uMMyGJB6aAaY2nMm0b3kQrzzSRfwbki1WoaNr/ByukZ+uzfNZgBmEk
ox81cssbIHydJFGCfXTvN+TAhe00ahPIXGFl2wTWqgb6FwJvX+kaUPtVyZiCaTIO
BI4qQcWH+6Gxdh/XDF7r
=TK09
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Two fixes for problems introduced recently in the cpufreq core and the
intel_pstate driver.
Specifics:
- Fix a silly mistake related to the clamp_val() usage in a function
added by a recent commit (Rafael Wysocki).
- Reduce the log level of an annoying message added to intel_pstate
during the recent merge window (Srinivas Pandruvada)"
* tag 'pm-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix clamp_val() usage in cpufreq_driver_fast_switch()
cpufreq: intel_pstate: Downgrade print level for _PPC
Merge various fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies
mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy
mm, oom_reaper: do not use siglock in try_oom_reaper()
mm, page_alloc: prevent infinite loop in buffered_rmqueue()
checkpatch: reduce git commit description style false positives
mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem()
mm: check the return value of lookup_page_ext for all call sites
kdump: fix dmesg gdbmacro to work with record based printk
mm: fix overflow in vm_map_ram()
Pull irq fixes from Thomas Gleixner:
- a few simple fixes for fallout from the recent gic-v3 changes
- a workaround for a Cavium thunderX erratum
- a bugfix for the pic32 irqchip to make external interrupts work proper
- a missing return value in the generic IPI management code
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-pic32-evic: Fix bug with external interrupts.
irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
irqchip/gic-v3: Fix quiescence check in gic_enable_redist
irqchip/gic-v3: Fix copy+paste mistakes in defines
irqchip/gic-v3: Fix ICC_SGI1R_EL1.INTID decoding mask
genirq: Fix missing return value in irq_destroy_ipi()
The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations. The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.
However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct. This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies. This will alter the zone used for statistics but only after
it is known that it makes sense for that context. Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone. Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.
Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geert Uytterhoeven reported the following problem that bisected to
commit c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone
in a zonelist twice") on m68k/ARAnyM
BUG: scheduling while atomic: cron/668/0x10c9a0c0
Modules linked in:
CPU: 0 PID: 668 Comm: cron Not tainted 4.6.0-atari-05133-gc33d6c06f60f710f #364
Call Trace: [<0003d7d0>] __schedule_bug+0x40/0x54
__schedule+0x312/0x388
__schedule+0x0/0x388
prepare_to_wait+0x0/0x52
schedule+0x64/0x82
schedule_timeout+0xda/0x104
set_next_entity+0x18/0x40
pick_next_task_fair+0x78/0xda
io_schedule_timeout+0x36/0x4a
bit_wait_io+0x0/0x40
bit_wait_io+0x12/0x40
__wait_on_bit+0x46/0x76
wait_on_page_bit_killable+0x64/0x6c
bit_wait_io+0x0/0x40
wake_bit_function+0x0/0x4e
__lock_page_or_retry+0xde/0x124
do_scan_async+0x114/0x17c
lookup_swap_cache+0x24/0x4e
handle_mm_fault+0x626/0x7de
find_vma+0x0/0x66
down_read+0x0/0xe
wait_on_page_bit_killable_timeout+0x77/0x7c
find_vma+0x16/0x66
do_page_fault+0xe6/0x23a
res_func+0xa3c/0x141a
buserr_c+0x190/0x6d4
res_func+0xa3c/0x141a
buserr+0x20/0x28
res_func+0xa3c/0x141a
buserr+0x20/0x28
The relationship is not obvious but it's due to a failure to rescan the
full zonelist after the fair zone allocation policy exhausts the batch
count. While this is a functional problem, it's also a performance
issue. A page allocator microbenchmark showed the following
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
Min alloc-odr0-1 327.00 ( 0.00%) 326.00 ( 0.31%)
Min alloc-odr0-2 235.00 ( 0.00%) 235.00 ( 0.00%)
Min alloc-odr0-4 198.00 ( 0.00%) 198.00 ( 0.00%)
Min alloc-odr0-8 170.00 ( 0.00%) 170.00 ( 0.00%)
Min alloc-odr0-16 156.00 ( 0.00%) 156.00 ( 0.00%)
Min alloc-odr0-32 150.00 ( 0.00%) 150.00 ( 0.00%)
Min alloc-odr0-64 146.00 ( 0.00%) 146.00 ( 0.00%)
Min alloc-odr0-128 145.00 ( 0.00%) 145.00 ( 0.00%)
Min alloc-odr0-256 155.00 ( 0.00%) 155.00 ( 0.00%)
Min alloc-odr0-512 168.00 ( 0.00%) 165.00 ( 1.79%)
Min alloc-odr0-1024 175.00 ( 0.00%) 174.00 ( 0.57%)
Min alloc-odr0-2048 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-4096 187.00 ( 0.00%) 186.00 ( 0.53%)
Min alloc-odr0-8192 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-16384 191.00 ( 0.00%) 191.00 ( 0.00%)
Min alloc-odr1-1 736.00 ( 0.00%) 445.00 ( 39.54%)
Min alloc-odr1-2 343.00 ( 0.00%) 335.00 ( 2.33%)
Min alloc-odr1-4 277.00 ( 0.00%) 270.00 ( 2.53%)
Min alloc-odr1-8 238.00 ( 0.00%) 233.00 ( 2.10%)
Min alloc-odr1-16 224.00 ( 0.00%) 218.00 ( 2.68%)
Min alloc-odr1-32 210.00 ( 0.00%) 208.00 ( 0.95%)
Min alloc-odr1-64 207.00 ( 0.00%) 203.00 ( 1.93%)
Min alloc-odr1-128 276.00 ( 0.00%) 202.00 ( 26.81%)
Min alloc-odr1-256 206.00 ( 0.00%) 202.00 ( 1.94%)
Min alloc-odr1-512 207.00 ( 0.00%) 202.00 ( 2.42%)
Min alloc-odr1-1024 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr1-2048 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr1-4096 218.00 ( 0.00%) 216.00 ( 0.92%)
Min alloc-odr1-8192 341.00 ( 0.00%) 219.00 ( 35.78%)
Note that order-0 allocations are unaffected but higher orders get a
small boost from this patch and a large reduction in system CPU usage
overall as can be seen here:
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
User 85.32 86.31
System 2221.39 2053.36
Elapsed 2368.89 2202.47
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg has noted that siglock usage in try_oom_reaper is both pointless
and dangerous. signal_group_exit can be checked lockless. The problem
is that sighand becomes NULL in __exit_signal so we can crash.
Fixes: 3ef22dfff2 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")
Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>