We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
(memcg_alloc_cache_params() wants it for root caches), where we only
hold the slab_mutex and no memcg-related locks. As a result, we have to
update memcg_nr_cache_ids under the slab_mutex, which we can only take
on the slab's side (see memcg_update_array_size). This looks awkward
and will become even worse when per-memcg list_lru is introduced, which
also wants stable access to memcg_nr_cache_ids.
To get rid of this dependency between the memcg_nr_cache_ids and the
slab_mutex, this patch introduces a special rwsem. The rwsem is held
for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
updates. Therefore one can take it for reading to get a stable access
to memcg_caches arrays and/or memcg_nr_cache_ids.
Currently the semaphore is taken for reading only from
kmem_cache_create, right before taking the slab_mutex, so right now
there's no much point in using rwsem instead of mutex. However, once
list_lru is made per-memcg it will allow list_lru initializations to
proceed concurrently.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg_limited_groups_array_size, which defines the size of memcg_caches
arrays, sounds rather cumbersome. Also it doesn't point anyhow that
it's related to kmem/caches stuff. So let's rename it to
memcg_nr_cache_ids. It's concise and points us directly to
memcg_cache_id.
Also, rename kmem_limited_groups to memcg_cache_ida.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
set, it will be called per memory cgroup. The memory cgroup to scan
objects from is passed in shrink_control->memcg. If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list. Unaware shrinkers are only called on global pressure with
memcg=NULL.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to make FS shrinkers memcg-aware. To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support. That means when we hit the limit we will get ENOMEM w/o any
chance to recover. What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup. This is what
this patch set is intended to do.
Basically, it does two things. First, it introduces the notion of
per-memcg slab shrinker. A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
passed the memory cgroup to scan from in shrink_control->memcg. For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.
Secondly, this patch set makes the list_lru structure per-memcg. It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru. Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to. This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.
As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit. Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.
This patch (of 9):
NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:
count = list_lru_count_node(lru, sc->nid);
freed = list_lru_walk_node(lru, sc->nid, isolate_func,
isolate_arg, &sc->nr_to_scan);
where sc is an instance of the shrink_control structure passed to it
from vmscan.
To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.
This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Faults on the huge zero page are pointless and there is a BUG_ON to catch
them during fault time. This patch reintroduces a check that avoids
marking the zero page PAGE_NONE.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the NUMA PTE bits and associated helpers. As a
side-effect it increases the maximum possible swap space on x86-64.
One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in
the past due to an different migration mechanism and a dependance on page
lock to serialise migrations and warrants a closer look.
task_work hinting update parallel fault
------------------------ --------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
pmd_none
do_huge_pmd_anonymous_page
read? pmd_lock blocks until hinting complete, fail !pmd_none test
write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
pmd_modify
set_pmd_at
task_work hinting update parallel migration
------------------------ ------------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
do_huge_pmd_numa_page
migrate_misplaced_transhuge_page
pmd_lock waits for updates to complete, recheck pmd_same
pmd_modify
set_pmd_at
Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking
it would mark PTEs as not present and use another bit to distinguish
between NUMA hinting faults and other types of faults. It was
universally loved by everybody and caused no problems whatsoever. That
last sentence might be a lie.
This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them
relatively minor contributions. At their suggestion, acked-bys are in
there but I've no problem converting them to Signed-off-by if requested.
AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
for that. I tested trinity under kvm-tool and passed and ran a few
other basic tests. At the time of writing, only the short-lived tests
have completed but testing of V2 indicated that long-term testing had no
surprises. In most cases I'm leaving out detail as it's not that
interesting.
specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.
specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher. Again, interrupts are a major factor.
autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.
autonumabench
3.18.0-rc5 3.18.0-rc5
mmotm-20141119 protnone-v3r3
User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)
System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage is
great. Overall time to complete the benchmark is comparable
3.18.0-rc5 3.18.0-rc5
mmotm-20141119protnone-v3r3
User 59612.50 48586.44
System 460.22 1537.45
Elapsed 1442.20 1304.29
NUMA alloc hit 5075182 5743353
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 5075174 5743339
NUMA base PTE updates 637061448 443106883
NUMA huge PMD updates 1243434 864747
NUMA page range updates 1273699656 885857347
NUMA hint faults 1658116 1214277
NUMA hint local faults 959487 754113
NUMA hint local percent 57 62
NUMA pages migrated 5467056 61676398
The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be
simply down to bad luck but any modifications that affect that workload
would be related to scan rates and migrations, not the protection
mechanism. For all other workloads, migration activity was comparable.
Overall, headline performance figures are comparable but the overhead is
higher, mostly in interrupts. To some extent, higher overhead from this
approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.
This patch (of 10):
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stacking ontop of blk-mq devices. This blk-mq support changes the
model request-based DM uses for cloning a request to relying on
calling blk_get_request() directly from the underlying blk-mq device.
Early consumer of this code is Intel's emerging NVMe hardware; thanks
to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN
w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D
DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3
lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp
wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt
cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4=
=RiN2
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- The most significant change this cycle is request-based DM now
supports stacking ontop of blk-mq devices. This blk-mq support
changes the model request-based DM uses for cloning a request to
relying on calling blk_get_request() directly from the underlying
blk-mq device.
An early consumer of this code is Intel's emerging NVMe hardware;
thanks to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
* tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
dm snapshot: remove unnecessary NULL checks before vfree() calls
dm mpath: simplify failure path of dm_multipath_init()
dm thin metadata: remove unused dm_pool_get_data_block_size()
dm ioctl: fix stale comment above dm_get_inactive_table()
dm crypt: update url in CONFIG_DM_CRYPT help text
dm bufio: fix time comparison to use time_after_eq()
dm: use time_in_range() and time_after()
dm raid: fix a couple integer overflows
dm table: train hybrid target type detection to select blk-mq if appropriate
dm: allocate requests in target when stacking on blk-mq devices
dm: prepare for allocating blk-mq clone requests in target
dm: submit stacked requests in irq enabled context
dm: split request structure out from dm_rq_target_io structure
dm: remove exports for request-based interfaces without external callers
Pull block driver changes from Jens Axboe:
"This contains:
- The 4k/partition fixes for brd from Boaz/Matthew.
- A few xen front/back block fixes from David Vrabel and Roger Pau
Monne.
- Floppy changes from Takashi, cleaning the device file creation.
- Switching libata to use the new blk-mq tagging policy, removing
code (and a suboptimal implementation) from libata. This will
throw you a merge conflict, since a bug in the original libata
tagging code was fixed since this code was branched. Trivial.
From Shaohua.
- Conversion of loop to blk-mq, from Ming Lei.
- Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
He claims it improves on unreadable code, which will cost him a
beer.
- Maintainer update or NDB, now handled by Markus Pargmann.
- NVMe:
- Optimization from me that avoids a kmalloc/kfree per IO for
smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
overhead.
- Removal of (now) dead RCU code, a relic from before NVMe was
converted to blk-mq"
* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
xen-blkback: default to X86_32 ABI on x86
xen-blkfront: fix accounting of reqs when migrating
xen-blkback,xen-blkfront: add myself as maintainer
block: Simplify bsg complete all
floppy: Avoid manual call of device_create_file()
NVMe: avoid kmalloc/kfree for smaller IO
MAINTAINERS: Update NBD maintainer
libata: make sata_sil24 use fifo tag allocator
libata: move sas ata tag allocation to libata-scsi.c
libata: use blk taging
NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
null_blk: suppress invalid partition info
brd: Request from fdisk 4k alignment
brd: Fix all partitions BUGs
axonram: Fix bug in direct_access
loop: add blk-mq.h include
block: loop: don't handle REQ_FUA explicitly
block: loop: introduce lo_discard() and lo_req_flush()
block: loop: say goodby to bio
block: loop: improve performance via blk-mq
Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation
for write-out of PR APTPL metadata that Martin has recently been
running into.
It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed
memory instead of kzalloc, and increases the default hardcoded
length to 256k.
It also adds logic in core_scsi3_update_and_write_aptpl() to double
the original length upon core_scsi3_update_aptpl_buf() failure, and
retries until the vzalloc'ed buffer is large enough to accommodate
the outgoing APTPL metadata.
Reported-by: Martin Svec <martin.svec@zoner.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In case sendtargets response is larger than initiator MRDSL, we
send a partial sendtargets response (setting F=0, C=1, TTT!=0xffffffff),
accept a consecutive empty text message and send the rest of the payload.
In case we are done, we set F=1, C=0, TTT=0xffffffff.
We do that by storing the sendtargets response bytes done under
the session.
This patch also makes iscsit_find_cmd_from_itt public for isert.
(Re-add cmd->maxcmdsn_inc and clear in iscsit_build_text_rsp - nab)
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Pull nfsd updates from Bruce Fields:
"The main change is the pNFS block server support from Christoph, which
allows an NFS client connected to shared disk to do block IO to the
shared disk in place of NFS reads and writes. This also requires xfs
patches, which should arrive soon through the xfs tree, barring
unexpected problems. Support for other filesystems is also possible
if there's interest.
Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
shape"
* 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd: default NFSv4.2 to on
nfsd: pNFS block layout driver
exportfs: add methods for block layout exports
nfsd: add trace events
nfsd: update documentation for pNFS support
nfsd: implement pNFS layout recalls
nfsd: implement pNFS operations
nfsd: make find_any_file available outside nfs4state.c
nfsd: make find/get/put file available outside nfs4state.c
nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
nfsd: add fh_fsid_match helper
nfsd: move nfsd_fh_match to nfsfh.h
fs: add FL_LAYOUT lease type
fs: track fl_owner for leases
nfs: add LAYOUT_TYPE_MAX enum value
nfsd: factor out a helper to decode nfstime4 values
sunrpc/lockd: fix references to the BKL
nfsd: fix year-2038 nfs4 state problem
svcrdma: Handle additional inline content
svcrdma: Move read list XDR round-up logic
...
This time with:
* Generic page-table framework for ARM IOMMUs using the LPAE page-table
format, ARM-SMMU and Renesas IPMMU make use of it already.
* Break out of the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The first
user will be the ARM64 common DMA-API implementation for IOMMUs
* Device tree support for Renesas IPMMU
* Various fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJU3MJOAAoJECvwRC2XARrjopUP+wachFx8vb00M4hlnlwL6FCn
DyIFkA1n4wL0muPhjcBI+LViEXrSxjr2TYoJEaBg+fiByWWQ1Hefg+KPz331Lo1D
+uo7WiOa1AB3pfkQiUN9IN6xx+o6ivhb3UQPiL4FHjggB/qz+KVxMM9nx0j8o0fQ
D9q6HLFiOIsFkra3xZaSuDGvYUBpcwyfn8FP1HVfvLlg1uxIGDcUJX3qU5UBpj9q
al/lPZ4A7rp+JLApV6WyouPiyVOZKikb5x920KeRNBem7a9fNBdgf+x7QbKpNXa1
5MaT5MarwGe8lJE4wtjOqRtsllhia+A1rg/6JbROPrlGetRFiuIh2sCKLvwOCko/
IjBHSutpaRT1lFoAG0TAnXQlvHRG/58XxOlP3eF613X/p8/cezuUaTyTIwZam9X3
j2GWwbUcBiHTxlu7bQDPz6a7cTf4w6wEALzYl18QrAFv+2LqlCfOo/LSlpStmjrF
kRN8DYaohlTULvmFneSr8rfGsnp5yPgIPvdmqiSwTz/Ih7kYPgfLy6+v6IAHUqZj
0n9oGs8eMqVvSzM2qqmyA9WGuQZRyhNjj4iDwn/he5YMw2kqxUQYGMpLnSu0Oi48
n4PqodtVol64jKLwaHZwyU8u71iyjUC5K9TDot/I2wlSRcTELJhxGh6c1sfDLyrO
u/htIszgKCgFvVrQoEZB
=dwrA
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This time with:
- Generic page-table framework for ARM IOMMUs using the LPAE
page-table format, ARM-SMMU and Renesas IPMMU make use of it
already.
- Break out the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The
first user will be the ARM64 common DMA-API implementation for
IOMMUs
- Device tree support for Renesas IPMMU
- Various fixes and cleanups all over the place"
* tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
iommu/amd: Convert non-returned local variable to boolean when relevant
iommu: Update my email address
iommu/amd: Use wait_event in put_pasid_state_wait
iommu/amd: Fix amd_iommu_free_device()
iommu/arm-smmu: Avoid build warning
iommu/fsl: Various cleanups
iommu/fsl: Use %pa to print phys_addr_t
iommu/omap: Print phys_addr_t using %pa
iommu: Make more drivers depend on COMPILE_TEST
iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered
iommu: Disable on !MMU builds
iommu/fsl: Remove unused fsl_of_pamu_ids[]
iommu/fsl: Fix section mismatch
iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator
iommu: Fix trace_map() to report original iova and original size
iommu/arm-smmu: add support for iova_to_phys through ATS1PR
iopoll: Introduce memory-mapped IO polling macros
iommu/arm-smmu: don't touch the secure STLBIALL register
iommu/arm-smmu: make use of generic LPAE allocator
iommu: io-pgtable-arm: add non-secure quirk
...
- DT unittests for I2C probing and overlays from Pantelis Antoniou
- Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha
- Add Tegra compatible strings missing for newer parts from Paul
Walmsley
- Various vendor prefix additions
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3CJVAAoJEMhvYp4jgsXieMgIAKlpr8gcMq/ORRRbVJ9jrL64
A0gPZZEBBVJ0BX7b6mvz15/6Zt70naoE23tMgaCQpR620ox9xFshmwhzHct9npiQ
KRode+9QhFRvA3Pc5LXhfD+bnyJ3Z4pWPrbY6sDDL9txqolpUhU4fz8Y3InwN5YB
GSD6NG3UKDmrTOvkR1j2WrCIkSeXYAEKtnuQlN/+eZXM6kzZYDcdskHv6o18mf4b
Ys6mwkfJdN3UZVQE8ZxUSi3wdC9U7mErNOZuc2rgL9Qb+q0RHtgE2GTI2Zxw0Sj1
BlCO1Fs0sYhOunZIazLJht7cenGbBMf+ed2DB4VLNiEmPhavqdv9wjNt9jOjh5k=
=Aviy
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree changes from Rob Herring:
- DT unittests for I2C probing and overlays from Pantelis Antoniou
- Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha
- Add Tegra compatible strings missing for newer parts from Paul
Walmsley
- Various vendor prefix additions
* tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: Add vendor prefix for OmniVision Technologies
of: Use ovti for Omnivision
of: Add vendor prefix for Truly Semiconductors Limited
of: Add vendor prefix for Himax Technologies Inc.
of/fdt: fix sparse warning
of: unitest: Add I2C overlay unit tests.
Documentation: DT: document compatible string existence requirement
Documentation: DT bindings: add nvidia, tegra132-denver compatible string
Documentation: DT bindings: add more Tegra chip compatible strings
of: EXPORT_SYMBOL_GPL of_property_read_u64_array
of: Fix brace position for struct of_device_id definition
of/unittest: Remove obsolete code
dt-bindings: use isil prefix for Intersil in vendor-prefixes.txt
Add AD Holdings Plc. to vendor-prefixes.
dt-bindings: Add Silicon Mitus vendor prefix
Removes OF_UNITTEST dependency on OF_DYNAMIC config symbol
pinctrl: fix up device tree bindings
DT: Vendors: Add Everspin
doc: add bindings document for altera fpga manager
drivers: of: Export of_reserved_mem_device_{init,release}
Pull ARM updates from Russell King:
- clang assembly fixes from Ard
- optimisations and cleanups for Aurora L2 cache support
- efficient L2 cache support for secure monitor API on Exynos SoCs
- debug menu cleanup from Daniel Thompson to allow better behaviour for
multiplatform kernels
- StrongARM SA11x0 conversion to irq domains, and pxa_timer
- kprobes updates for older ARM CPUs
- move probes support out of arch/arm/kernel to arch/arm/probes
- add inline asm support for the rbit (reverse bits) instruction
- provide an ARM mode secondary CPU entry point (for Qualcomm CPUs)
- remove the unused ARMv3 user access code
- add driver_override support to AMBA Primecell bus
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (55 commits)
ARM: 8256/1: driver coamba: add device binding path 'driver_override'
ARM: 8301/1: qcom: Use secondary_startup_arm()
ARM: 8302/1: Add a secondary_startup that assumes ARM mode
ARM: 8300/1: teach __asmeq that r11 == fp and r12 == ip
ARM: kprobes: Fix compilation error caused by superfluous '*'
ARM: 8297/1: cache-l2x0: optimize aurora range operations
ARM: 8296/1: cache-l2x0: clean up aurora cache handling
ARM: 8284/1: sa1100: clear RCSR_SMR on resume
ARM: 8283/1: sa1100: collie: clear PWER register on machine init
ARM: 8282/1: sa1100: use handle_domain_irq
ARM: 8281/1: sa1100: move GPIO-related IRQ code to gpio driver
ARM: 8280/1: sa1100: switch to irq_domain_add_simple()
ARM: 8279/1: sa1100: merge both GPIO irqdomains
ARM: 8278/1: sa1100: split irq handling for low GPIOs
ARM: 8291/1: replace magic number with PAGE_SHIFT macro in fixup_pv code
ARM: 8290/1: decompressor: fix a wrong comment
ARM: 8286/1: mm: Fix dma_contiguous_reserve comment
ARM: 8248/1: pm: remove outdated comment
ARM: 8274/1: Fix DEBUG_LL for multi-platform kernels (without PL01X)
ARM: 8273/1: Seperate DEBUG_UART_PHYS from DEBUG_LL on EP93XX
...
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways to
make trace events. Lots of features have been added since the sample
code was made, and these features are mostly unknown. Developers
have been making their own hacks to do things that are already available.
o Performance improvements. Most notably, I found a performance bug where
a waiter that is waiting for a full page from the ring buffer will
see that a full page is not available, and go to sleep. The sched
event caused by it going to sleep would cause it to wake up again.
It would see that there was still not a full page, and go back to sleep
again, and that would wake it up again, until finally it would see a
full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3M+GAAoJEEjnJuOKh9ldpWQIAJTUzeVXlU0cf3bVn768VW7e
XS41WHF34l1tNevmKTh6fCPiw8+U0UMGLQt5WKtyaaARsZn2MlefLVuvHPKFlK2w
+qcI4OEVHH97Qgf9HWJSsYgnZaOnOE+TENqnokEgXMimRMuVcd/S4QaGxwJVDcjm
iBF5j2TaG4aGbx4a3J7KueoZ3K+39r3ut15hIGi/IZBZldQ1pt26ytafD/KA3CU3
BLRM2HLttAMsV1ds0EDLgZjSGICVetFcdOmI5Gwj7Qr3KrOTRPYJMNc8NdDL7Js9
v8VhujhFGvcCrhO/IKpVvd9yluz3RCF+Z7ihc+D/+1B3Nsm0PTwN3Fl5J+f89AA=
=u2Mm
-----END PGP SIGNATURE-----
Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The updates included in this pull request for ftrace are:
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways
to make trace events. Lots of features have been added since the
sample code was made, and these features are mostly unknown.
Developers have been making their own hacks to do things that are
already available.
o Performance improvements. Most notably, I found a performance bug
where a waiter that is waiting for a full page from the ring buffer
will see that a full page is not available, and go to sleep. The
sched event caused by it going to sleep would cause it to wake up
again. It would see that there was still not a full page, and go
back to sleep again, and that would wake it up again, until finally
it would see a full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths"
* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not wake up a splice waiter when page is not full
tracing: Fix unmapping loop in tracing_mark_write
tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
tracing: Add TRACE_EVENT_FN example
tracing: Add TRACE_EVENT_CONDITION sample
tracing: Update the TRACE_EVENT fields available in the sample code
tracing: Separate out initializing top level dir from instances
tracing: Make tracing_init_dentry_tr() static
trace: Use 64-bit timekeeping
tracing: Add array printing helper
tracing: Remove newline from trace_printk warning banner
tracing: Use IS_ERR() check for return value of tracing_init_dentry()
tracing: Remove unneeded includes of debugfs.h and fs.h
tracing: Remove taking of trace_types_lock in pipe files
tracing: Add ref count to tracer for when they are being read by pipe
The definition of rpc_count_iostats_metrics() is borked.
Reported by: Jim Davis <jim.epost@gmail.com>
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Cc: Tom Haynes <thomas.haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Userland code may be built using an ABI which permits linking to objects
that have more restrictive floating point requirements. For example,
userland code may be built to target the O32 FPXX ABI. Such code may be
linked with other FPXX code, or code built for either one of the more
restrictive FP32 or FP64. When linking with more restrictive code, the
overall requirement of the process becomes that of the more restrictive
code. The kernel has no way to know in advance which mode the process
will need to be executed in, and indeed it may need to change during
execution. The dynamic loader is the only code which will know the
overall required mode, and so it needs to have a means to instruct the
kernel to switch the FP mode of the process.
This patch introduces 2 new options to the prctl syscall which provide
such a capability. The FP mode of the process is represented as a
simple bitmask combining a number of mode bits mirroring those present
in the hardware. Userland can either retrieve the current FP mode of
the process:
mode = prctl(PR_GET_FP_MODE);
or modify the current FP mode of the process:
err = prctl(PR_SET_FP_MODE, new_mode);
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Matthew Fortune <matthew.fortune@imgtec.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/8899/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Pull security layer updates from James Morris:
"Highlights:
- Smack adds secmark support for Netfilter
- /proc/keys is now mandatory if CONFIG_KEYS=y
- TPM gets its own device class
- Added TPM 2.0 support
- Smack file hook rework (all Smack users should review this!)"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
cipso: don't use IPCB() to locate the CIPSO IP option
SELinux: fix error code in policydb_init()
selinux: add security in-core xattr support for pstore and debugfs
selinux: quiet the filesystem labeling behavior message
selinux: Remove unused function avc_sidcmp()
ima: /proc/keys is now mandatory
Smack: Repair netfilter dependency
X.509: silence asn1 compiler debug output
X.509: shut up about included cert for silent build
KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
MAINTAINERS: email update
tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
smack: fix possible use after frees in task_security() callers
smack: Add missing logging in bidirectional UDS connect check
Smack: secmark support for netfilter
Smack: Rework file hooks
tpm: fix format string error in tpm-chip.c
char/tpm/tpm_crb: fix build error
smack: Fix a bidirectional UDS connect check typo
smack: introduce a special case for tmpfs in smack_d_instantiate()
...
Pull audit fix from Paul Moore:
"Just one patch from the audit tree for v3.20, and a very minor one at
that.
The patch simply removes an old, unused field from the audit_krule
structure, a private audit-only struct. In audit related news, we did
a proper overhaul of the audit pathname code and removed the nasty
getname()/putname() hacks for audit, you should see those patches in
Al's vfs tree if you haven't already.
That's it for audit this time, let's hope for a quiet -rcX series"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: remove vestiges of vers_ops
Merge second set of updates from Andrew Morton:
"More of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
vmstat: Reduce time interval to stat update on idle cpu
mm/page_owner.c: remove unnecessary stack_trace field
Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
mm: incorporate read-only pages into transparent huge pages
vmstat: do not use deferrable delayed work for vmstat_update
mm: more aggressive page stealing for UNMOVABLE allocations
mm: always steal split buddies in fallback allocations
mm: when stealing freepages, also take pages created by splitting buddy page
mincore: apply page table walker on do_mincore()
mm: /proc/pid/clear_refs: avoid split_huge_page()
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
mempolicy: apply page table walker on queue_pages_range()
arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
memcg: cleanup preparation for page table walk
numa_maps: remove numa_maps->vma
numa_maps: fix typo in gather_hugetbl_stats
pagemap: use walk->vma instead of calling find_vma()
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
...
Including:
- Update of all defconfigs
- Addition of a bunch of config options to modernise our defconfigs
- Some PS3 updates from Geoff
- Optimised memcmp for 64 bit from Anton
- Fix for kprobes that allows 'perf probe' to work from Naveen
- Several cxl updates from Ian & Ryan
- Expanded support for the '24x7' PMU from Cody & Sukadev
- Freescale updates from Scott:
"Highlights include 8xx optimizations, some more work on datapath device
tree content, e300 machine check support, t1040 corenet error reporting,
and various cleanups and fixes."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2/LSAAoJEFHr6jzI4aWATDAQAKPU6v2Mq0sLnGst69waHU/Q
vvpIq9hqVeSr6znHhrnazc3iQTLk0acqIdxUl/dT+5ADhi9+FxGD5Ckk+BH1DDve
g6mQelSMlVZF9hKonHsbr4iUuTUyZyx2vj2qjdgOaRiv9Xubq6vUFNeolq3AeHxv
J33vqRTmowj3VJ52u+V1dmzXQGfUye7DG2jHpjXoBieZsroTvyuYm5GoIPblWFO6
zbYRh6IitALnQRtXfwIManPyWMkJti9JX8PwDkmvacr+V+MXbrksHpIOITMhNlo1
WsVnFMpxuk80XuUfhaKZgISgBSfCqBckvKDn2QwztF2/kBnV6Su5xiOKVgouzM6B
myy+maiMZlNJlNjqdMK5v2bqMXICP048zgfMbDN2e1K25jSSlRawt0RngoCQO2EP
7aWmEDAlL3shgzkl68pj1fevQokxC/40C1yExIgAa9C31+bjtMz4Xb1SfN1SSveW
7uWEY/eG9eLsrSE1CeBDvh6B8BRdyuIHgPhux4Tgc/bUtBGFQ29NuXwKh3QCeEy9
9wWrRGx3U69eP06Ey7P5js3jPTQs80bjJewyGaiPQF5XHB89To8Dg8VfXjEV49Dx
Pa3OLL5QsQloKfEBiEhQeGfKYImC00pVYAxc0qpmnr9T+25Ri1TLdF1EBAwriSYE
5p9kSW+ZIht0lvzsdPNm
=xDU3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- Update of all defconfigs
- Addition of a bunch of config options to modernise our defconfigs
- Some PS3 updates from Geoff
- Optimised memcmp for 64 bit from Anton
- Fix for kprobes that allows 'perf probe' to work from Naveen
- Several cxl updates from Ian & Ryan
- Expanded support for the '24x7' PMU from Cody & Sukadev
- Freescale updates from Scott:
"Highlights include 8xx optimizations, some more work on datapath
device tree content, e300 machine check support, t1040 corenet
error reporting, and various cleanups and fixes"
* tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
cxl: Add missing return statement after handling AFU errror
cxl: Fail AFU initialisation if an invalid configuration record is found
cxl: Export optional AFU configuration record in sysfs
powerpc/mm: Warn on flushing tlb page in kernel context
powerpc/powernv: Add OPAL soft-poweroff routine
powerpc/perf/hv-24x7: Document sysfs event description entries
powerpc/perf/hv-gpci: add the remaining gpci requests
powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
perf: add PMU_EVENT_ATTR_STRING() helper
perf: provide sysfs_show for struct perf_pmu_events_attr
powerpc/kernel: Avoid initializing device-tree pointer twice
powerpc: Remove old compile time disabled syscall tracing code
powerpc/kernel: Make syscall_exit a local label
cxl: Fix device_node reference counting
powerpc/mm: bail out early when flushing TLB page
powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
perf/powerpc: reset event hw state when adding it to the PMU
powerpc/qe: Use strlcpy()
...
- reimplementation of the virtual remapping of UEFI Runtime Services in
a way that is stable across kexec
- emulation of the "setend" instruction for 32-bit tasks (user
endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
accordingly)
- compat_sys_call_table implemented in C (from asm) and made it a
constant array together with sys_call_table
- export CPU cache information via /sys (like other architectures)
- DMA API implementation clean-up in preparation for IOMMU support
- macros clean-up for KVM
- dropped some unnecessary cache+tlb maintenance
- CONFIG_ARM64_CPU_SUSPEND clean-up
- defconfig update (CPU_IDLE)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU25v3AAoJEGvWsS0AyF7xYjcP/j8ESvs+z0BPgeJ6XREfOnCh
cp+w/1rJ5BafJ5RRkibrciwTNOIJS4FGMivWyURtoh430lS0Rh7fxZ3Ouna3xjrT
Nf7AxenWoA8Lo6wHh+FlNUeGk3iWfX6WwA2tYrbKudK+LBJ1wHjwpE7cWQO0FgwJ
aFDahu+QD5/u45p/VcVctMtiEDvOxBdO8gfat6r+YkLm7pbRxQkZnpA/JE4Gps1p
Td5jvMNH9pXI5pffSbeR9Q+vs/r0yqKLXQg01Eb2bZgGDgwf9yzADrHuaKamZt35
X5flmLiTGC6swJCJvUkZC1Nuue33bXcvW5+vgvar+MNGyXsxv+B/wARLqGhiWhQZ
nLGwFpuNu6wdY9tGHb/XR8khcewkw1/lRH1hHKhchrmRyUqHvXcPgC5tamjLrY8C
BV3BAeQvRho8OKwWUmbXIlyON1vPux6CJdj4D/A5NL+qph2WHeVWJCXg6nVFx0Wc
Eb3bXbI4QRwTFL7pGRF8RyZJBAQtgYhQMKWMW2GHgUgn+r1EixG73BZoSwvpHrrw
FOR9AVNfVBqmNON8xiIb3DN4EViq76EF0jrsZh5I9EoWS2w5qtk60kJQgXE+M4EE
vOlmh3dhEVfCN2SxOn0bgoQmTulyjqGauTSSJKQbIBuinPFveukrJfGNFIWt0SZs
f38FBMo6sgU4VG85B+Fr
=X5x/
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"arm64 updates for 3.20:
- reimplementation of the virtual remapping of UEFI Runtime Services
in a way that is stable across kexec
- emulation of the "setend" instruction for 32-bit tasks (user
endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
accordingly)
- compat_sys_call_table implemented in C (from asm) and made it a
constant array together with sys_call_table
- export CPU cache information via /sys (like other architectures)
- DMA API implementation clean-up in preparation for IOMMU support
- macros clean-up for KVM
- dropped some unnecessary cache+tlb maintenance
- CONFIG_ARM64_CPU_SUSPEND clean-up
- defconfig update (CPU_IDLE)
The EFI changes going via the arm64 tree have been acked by Matt
Fleming. There is also a patch adding sys_*stat64 prototypes to
include/linux/syscalls.h, acked by Andrew Morton"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits)
arm64: compat: Remove incorrect comment in compat_siginfo
arm64: Fix section mismatch on alloc_init_p[mu]d()
arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros
arm64: mm: use *_sect to check for section maps
arm64: drop unnecessary cache+tlb maintenance
arm64:mm: free the useless initial page table
arm64: Enable CPU_IDLE in defconfig
arm64: kernel: remove ARM64_CPU_SUSPEND config option
arm64: make sys_call_table const
arm64: Remove asm/syscalls.h
arm64: Implement the compat_sys_call_table in C
syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64
compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes
arm64: uapi: expose our struct ucontext to the uapi headers
smp, ARM64: Kill SMP single function call interrupt
arm64: Emulate SETEND for AArch32 tasks
arm64: Consolidate hotplug notifier for instruction emulation
arm64: Track system support for mixed endian EL0
arm64: implement generic IOMMU configuration
arm64: Combine coherent and non-coherent swiotlb dma_ops
...
Pull s390 updates from Martin Schwidefsky:
- The remaining patches for the z13 machine support: kernel build
option for z13, the cache synonym avoidance, SMT support,
compare-and-delay for spinloops and the CES5S crypto adapater.
- The ftrace support for function tracing with the gcc hotpatch option.
This touches common code Makefiles, Steven is ok with the changes.
- The hypfs file system gets an extension to access diagnose 0x0c data
in user space for performance analysis for Linux running under z/VM.
- The iucv hvc console gets wildcard spport for the user id filtering.
- The cacheinfo code is converted to use the generic infrastructure.
- Cleanup and bug fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/process: free vx save area when releasing tasks
s390/hypfs: Eliminate hypfs interval
s390/hypfs: Add diagnose 0c support
s390/cacheinfo: don't use smp_processor_id() in preemptible context
s390/zcrypt: fixed domain scanning problem (again)
s390/smp: increase maximum value of NR_CPUS to 512
s390/jump label: use different nop instruction
s390/jump label: add sanity checks
s390/mm: correct missing space when reporting user process faults
s390/dasd: cleanup profiling
s390/dasd: add locking for global_profile access
s390/ftrace: hotpatch support for function tracing
ftrace: let notrace function attribute disable hotpatching if necessary
ftrace: allow architectures to specify ftrace compile options
s390: reintroduce diag 44 calls for cpu_relax()
s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
s390/zcrypt: Number of supported ap domains is not retrievable.
s390/spinlock: add compare-and-delay to lock wait loops
s390/tape: remove redundant if statement
s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
...
Highlights incluse:
Features:
- Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0)
makes for a significant performance improvement in metadata intensive
workloads.
- Full support for the pNFS "flexible files" layout type
- Further RPC/RDMA client improvements from Chuck
Bugfixes:
- Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
- Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
- Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part
of the namespace cleanup,
- Stable fix: Ensure we reference the inode for return-on-close in delegreturn
- Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the
same source address/port combination during a disconnect/reconnect event.
This is a requirement imposed by most NFSv3 server duplicate reply cache
implementations.
Optimisations:
- Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
Other:
- Add Anna Schumaker as co-maintainer for the NFS client
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2swgAAoJEGcL54qWCgDyCWoP/1bxN8PesqaiwsBm3fsEqcra
WZtMirDIpJYpHwgysdv9t5otBQrb7GrLlNyGZ9NBOVNakifoyj2tHe+/XGDx7Qny
iYxXam0QdyjLU+bi4QoG4bdFncwQ/NmC6fqoG0rc25Il96Oggnc6LeSwL6Koc3CD
QitRLLi/PaU5qtuaV80+tYMJiqZbpBdVjB+xfSpu7rhyWzm1QNdEeQYor5CozzMi
6cRJuvHgjoZ1xriCWdxQHjqOiEaKNLwfm3uZ3XVaaUAIDhStXugdhIihj3J6Wi7k
MKNuY+AKJiy3yOdFfhYALyq+TPundDbYoM9x1foigjgP8zxXVfIU3VS6l33TSlzX
zH+/lcnXAHFWjFYoAijG1gv1H+OYcTuDlKaYAShQ/cOkTfWFrmlWv+pZs3SSkmPY
4Aeu97YYOkB5ZZ7wTWKksQMeAu/LYNRSA3h+ANvEIR+NLlTSQTcaChlvBmS0IY5D
qMmko1Xgmsxv+B8UeIY7PLfGBGrUdFho1JiDTfL8Xk7fGOfM7iBtMeaMAfdyOSUq
AMqH9EDUUOWaFDggO2iisLtMCY6kJ0iFGKRTwzR38jAqm3bjWaIDitUqshNrNbC+
mbwvAVxn0IFSCJGFsVd3kD2rTLGDElZ25GLFW9JMalarE6nlLG7e4p65g209Q9bT
HYKiyinJJM2Zji07kmG/
=c47U
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights incluse:
Features:
- Removing the forced serialisation of open()/close() calls in
NFSv4.x (x>0) makes for a significant performance improvement in
metadata intensive workloads.
- Full support for the pNFS "flexible files" layout type
- Further RPC/RDMA client improvements from Chuck
Bugfixes:
- Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
- Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
- Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
as part of the namespace cleanup,
- Stable fix: Ensure we reference the inode for return-on-close in
delegreturn
- Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
the same source address/port combination during a disconnect/
reconnect event. This is a requirement imposed by most NFSv3
server duplicate reply cache implementations.
Optimisations:
- Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
Other:
- Add Anna Schumaker as co-maintainer for the NFS client"
* tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
SUNRPC: Cleanup to remove xs_tcp_close()
pnfs: delete an unintended goto
pnfs/flexfiles: Do not dprintk after the free
SUNRPC: Fix stupid typo in xs_sock_set_reuseport
SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
SUNRPC: Handle connection reset more efficiently.
SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
SUNRPC: Remove TCP socket linger code
SUNRPC: Remove TCP client connection reset hack
SUNRPC: TCP/UDP always close the old socket before reconnecting
SUNRPC: Add helpers to prevent socket create from racing
SUNRPC: Ensure xs_reset_transport() resets the close connection flags
SUNRPC: Do not clear the source port in xs_reset_transport
SUNRPC: Handle EADDRINUSE on connect
SUNRPC: Set SO_REUSEPORT socket option for TCP connections
NFSv4.1: Fix pnfs_put_lseg races
NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
...
Page owner uses the page_ext structure to keep meta-information for every
page in the system. The structure also contains a field of type 'struct
stack_trace', page owner uses this field during invocation of the function
save_stack_trace. It is easy to notice that keeping a copy of this
structure for every page in the system is very inefficiently in terms of
memory.
The patch removes this unnecessary field of page_ext and forces page owner
to use a stack_trace structure allocated on the stack.
[akpm@linux-foundation.org: use struct initializers]
Signed-off-by: Sergei Rogachev <rogachevsergei@gmail.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages(). The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.
Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks. Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.
The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results. Here I used different baseline which includes RFC compaction
improvements from [1]. I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.
First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test. First
column is after fresh reboot, other two are reiterations of test without
reboot. That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Page alloc extfrag event 10264225 8702233 10244125
Extfrag fragmenting 10263271 8701552 10243473
Extfrag fragmenting for unmovable 13595 17616 15960
Extfrag fragmenting unmovable placed with movable 7989 12193 8447
Extfrag fragmenting for reclaimable 658 1840 1817
Extfrag fragmenting reclaimable placed with movable 558 1677 1679
Extfrag fragmenting for movable 10249018 8682096 10225696
With Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Page alloc extfrag event 11834954 9877523 9774860
Extfrag fragmenting 11833993 9876880 9774245
Extfrag fragmenting for unmovable 7342 16129 11712
Extfrag fragmenting unmovable placed with movable 4191 10547 6270
Extfrag fragmenting for reclaimable 373 1130 923
Extfrag fragmenting reclaimable placed with movable 302 906 738
Extfrag fragmenting for movable 11826278 9859621 9761610
With Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Page alloc extfrag event 4725990 3668793 3807436
Extfrag fragmenting 4725104 3668252 3806898
Extfrag fragmenting for unmovable 6678 7974 7281
Extfrag fragmenting unmovable placed with movable 2051 3829 4017
Extfrag fragmenting for reclaimable 429 1208 1278
Extfrag fragmenting reclaimable placed with movable 369 976 1034
Extfrag fragmenting for movable 4717997 3659070 3798339
With Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Page alloc extfrag event 5016183 4700142 3850633
Extfrag fragmenting 5015325 4699613 3850072
Extfrag fragmenting for unmovable 1312 3154 3088
Extfrag fragmenting unmovable placed with movable 1115 2777 2714
Extfrag fragmenting for reclaimable 437 1193 1097
Extfrag fragmenting reclaimable placed with movable 330 969 879
Extfrag fragmenting for movable 5013576 4695266 3845887
In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise. Here, each patch
improves the situation for unmovable events. Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO. The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless. These are least critical as
compaction can move them around.
If we look at success rates, the patches don't affect them, that didn't change.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%)
Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%)
Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%)
Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%)
Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%)
Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%)
Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%)
Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%)
Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%)
Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%)
Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%)
Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%)
Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%)
Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%)
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%)
Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%)
Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%)
Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%)
Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%)
Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%)
Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%)
Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%)
Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%)
Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%)
Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%)
Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%)
Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%)
Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%)
While there's no improvement here, I consider reduced fragmentation events
to be worth on its own. Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:
Patch 1:
Compaction stalls 4153 3959 3978
Compaction success 1523 1441 1446
Compaction failures 2630 2517 2531
Page migrate success 4600827 4943120 5104348
Page migrate failure 19763 16656 17806
Compaction pages isolated 9597640 10305617 10653541
Compaction migrate scanned 77828948 86533283 87137064
Compaction free scanned 517758295 521312840 521462251
Compaction cost 5503 5932 6110
Patch 2:
Compaction stalls 3800 3450 3518
Compaction success 1421 1316 1317
Compaction failures 2379 2134 2201
Page migrate success 4160421 4502708 4752148
Page migrate failure 19705 14340 14911
Compaction pages isolated 8731983 9382374 9910043
Compaction migrate scanned 98362797 96349194 98609686
Compaction free scanned 496512560 469502017 480442545
Compaction cost 5173 5526 5811
As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.
Configuring the benchmark to allocate like THP page fault (i.e. no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot. This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-thp-1 5-thp-2 5-thp-3
Page alloc extfrag event 8148965 6227815 6646741
Extfrag fragmenting 8147872 6227130 6646117
Extfrag fragmenting for unmovable 10324 12942 15975
Extfrag fragmenting unmovable placed with movable 5972 8495 10907
Extfrag fragmenting for reclaimable 601 1707 2210
Extfrag fragmenting reclaimable placed with movable 520 1570 2000
Extfrag fragmenting for movable 8136947 6212481 6627932
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-thp-1 6-thp-2 6-thp-3
Page alloc extfrag event 8345457 7574471 7020419
Extfrag fragmenting 8343546 7573777 7019718
Extfrag fragmenting for unmovable 10256 18535 30716
Extfrag fragmenting unmovable placed with movable 6893 11726 22181
Extfrag fragmenting for reclaimable 465 1208 1023
Extfrag fragmenting reclaimable placed with movable 353 996 843
Extfrag fragmenting for movable 8332825 7554034 6987979
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-thp-1 7-thp-2 7-thp-3
Page alloc extfrag event 3512847 3020756 2891625
Extfrag fragmenting 3511940 3020185 2891059
Extfrag fragmenting for unmovable 9017 6892 6191
Extfrag fragmenting unmovable placed with movable 1524 3053 2435
Extfrag fragmenting for reclaimable 445 1081 1160
Extfrag fragmenting reclaimable placed with movable 375 918 986
Extfrag fragmenting for movable 3502478 3012212 2883708
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-thp-1 8-thp-2 8-thp-3
Page alloc extfrag event 3181699 3082881 2674164
Extfrag fragmenting 3180812 3082303 2673611
Extfrag fragmenting for unmovable 1201 4031 4040
Extfrag fragmenting unmovable placed with movable 974 3611 3645
Extfrag fragmenting for reclaimable 478 1165 1294
Extfrag fragmenting reclaimable placed with movable 387 985 1030
Extfrag fragmenting for movable 3179133 3077107 2668277
The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1. Anyway, patch 2 rectifies it.
Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.
[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
This patch (of 3):
When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype. With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:
1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)
These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g. distribute UNMOVABLE allocations among multiple
pageblocks.
Originally, decision for 1) has implied the decision for 3). Commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.
Commit fef903efcf ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended. Commit 0cbef29a78 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1). This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.
This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.
Note that evaluation has shown that the behavior introduced by
47118af076 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again. For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a78.
[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a78]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce walk_page_vma(), which is useful for the callers which want to
walk over a given vma. It's used by later patches.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.
From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.
In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle. So let's remove it to reduce overhead.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the more generic get_user_pages_unlocked which has the additional
benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
(which allows the first page fault in an unmapped area to be always able
to block indefinitely by being allowed to release the mmap_sem).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.
Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).
So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.
The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).
A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).
While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .
The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.
This patch (of 5):
We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.
The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:
down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);
The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).
If vmas is not NULL these two methods cannot be used.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous commit ("mm/thp: Allocate transparent hugepages on local
node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
special policy for THP allocations. The function has the same interface
as alloc_pages_vma(), shares a lot of boilerplate code and a long
comment.
This patch merges the hugepage special case into alloc_pages_vma. The
extra if condition should be cheap enough price to pay. We also prevent
a (however unlikely) race with parallel mems_allowed update, which could
make hugepage allocation restart only within the fallback call to
alloc_hugepage_vma() and not reconsider the special rule in
alloc_hugepage_vma().
Also by making sure mpol_cond_put(pol) is always called before actual
allocation attempt, we can use a single exit path within the function.
Also update the comment for missing node parameter and obsolete reference
to mm_sem.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This make sure that we try to allocate hugepages from local node if
allowed by mempolicy. If we can't, we fallback to small page allocation
based on mempolicy. This is based on the observation that allocating
pages on local node is more beneficial than allocating hugepages on remote
node.
With this patch applied we may find transparent huge page allocation
failures if the current node doesn't have enough freee hugepages. Before
this patch such failures result in us retrying the allocation on other
nodes in the numa node mask.
[akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction deferring logic is heavy hammer that block the way to the
compaction. It doesn't consider overall system state, so it could prevent
user from doing compaction falsely. In other words, even if system has
enough range of memory to compact, compaction would be skipped due to
compaction deferring logic. This patch add new tracepoint to understand
work of deferring logic. This will also help to check compaction success
and fail.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not well analyzed that when/why compaction start/finish or not.
With these new tracepoints, we can know much more about start/finish
reason of compaction. I can find following bug with these tracepoint.
http://www.spinics.net/lists/linux-mm/msg81582.html
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It'd be useful to know current range where compaction work for detailed
analysis. With it, we can know pageblock where we actually scan and
isolate, and, how much pages we try in that pageblock and can guess why it
doesn't become freepage with pageblock order roughly.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We now have tracepoint for begin event of compaction and it prints start
position of both scanners, but, tracepoint for end event of compaction
doesn't print finish position of both scanners. It'd be also useful to
know finish position of both scanners so this patch add it. It will help
to find odd behavior or problem on compaction internal logic.
And mode is added to both begin/end tracepoint output, since according to
mode, compaction behavior is quite different.
And lastly, status format is changed to string rather than status number
for readability.
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To check the range that compaction is working, tracepoint print
start/end pfn of zone and start pfn of both scanner with decimal format.
Since we manage all pages in order of 2 and it is well represented by
hexadecimal, this patch change the tracepoint format from decimal to
hexadecimal. This would improve readability. For example, it makes us
easily notice whether current scanner try to compact previously
attempted pageblock or not.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an architecure uses <asm-generic/4level-fixup.h>, build fails if we
try to use PUD_SHIFT in generic code:
In file included from arch/microblaze/include/asm/bug.h:1:0,
from include/linux/bug.h:4,
from include/linux/thread_info.h:11,
from include/asm-generic/preempt.h:4,
from arch/microblaze/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/spinlock.h:50,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:5,
from include/linux/slab.h:14,
from mm/mmap.c:12:
mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2858:46: error: 'PUD_SHIFT' undeclared (first use in this function)
round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
^
include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^
mm/mmap.c:2858:46: note: each undeclared identifier is reported only once for each function it appears in
round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
^
include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^
As with <asm-generic/pgtable-nopud.h>, let's define PUD_SHIFT to
PGDIR_SHIFT.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5695be142e ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.
Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.
oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.
oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.
The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.
out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.
Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen. But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation. In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen. This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.
The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.
The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_. As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably. I can
only speculate what might happen when a task is still runnable
unexpectedly.
On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.
I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
freezing(current) then it tries another round of allocation before calling
try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
I think this should be OK because __thaw_task shouldn't interfere with any
locking down wake_up_process. Oleg?
As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.
[ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[ 243.636072] (elapsed 2.837 seconds) done.
[ 243.641985] Trying to disable OOM killer
[ 243.643032] Waiting for concurent OOM victims
[ 243.644342] OOM killer disabled
[ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[ 243.652983] Suspending console(s) (use no_console_suspend to debug)
[ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[ 243.992600] PM: suspend of devices complete after 336.667 msecs
[ 243.993264] PM: late suspend of devices complete after 0.660 msecs
[ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[ 243.994717] ACPI: Preparing to enter system sleep state S3
[ 243.994795] PM: Saving platform NVS memory
[ 243.994796] Disabling non-boot CPUs ...
The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun and
Rafael. I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code. I have found several
surprises there.
This patch (of 5):
This patch is just a preparatory and it doesn't introduce any functional
change.
Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.
This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below. The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.
The control files are thus:
- memory.current shows the current consumption of the cgroup and its
descendants, in bytes.
- memory.low configures the lower end of the cgroup's expected
memory consumption range. The kernel considers memory below that
boundary to be a reserve - the minimum that the workload needs in
order to make forward progress - and generally avoids reclaiming
it, unless there is an imminent risk of entering an OOM situation.
- memory.high configures the upper end of the cgroup's expected
memory consumption range. A cgroup whose consumption grows beyond
this threshold is forced into direct reclaim, to work off the
excess and to throttle new allocations heavily, but is generally
allowed to continue and the OOM killer is not invoked.
- memory.max configures the hard maximum amount of memory that the
cgroup is allowed to consume before the OOM killer is invoked.
- memory.events shows event counters that indicate how often the
cgroup was reclaimed while below memory.low, how often it was
forced to reclaim excess beyond memory.high, how often it hit
memory.max, and how often it entered OOM due to memory.max. This
allows users to identify configuration problems when observing a
degradation in workload performance. An overcommitted system will
have an increased rate of low boundary breaches, whereas increased
rates of high limit breaches, maximum hits, or even OOM situations
will indicate internally overcommitted cgroups.
For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:
- The original lower boundary, the soft limit, is defined as a limit
that is per default unset. As a result, the set of cgroups that
global reclaim prefers is opt-in, rather than opt-out. The costs
for optimizing these mostly negative lookups are so high that the
implementation, despite its enormous size, does not even provide
the basic desirable behavior. First off, the soft limit has no
hierarchical meaning. All configured groups are organized in a
global rbtree and treated like equal peers, regardless where they
are located in the hierarchy. This makes subtree delegation
impossible. Second, the soft limit reclaim pass is so aggressive
that it not just introduces high allocation latencies into the
system, but also impacts system performance due to overreclaim, to
the point where the feature becomes self-defeating.
The memory.low boundary on the other hand is a top-down allocated
reserve. A cgroup enjoys reclaim protection when it and all its
ancestors are below their low boundaries, which makes delegation
of subtrees possible. Secondly, new cgroups have no reserve per
default and in the common case most cgroups are eligible for the
preferred reclaim pass. This allows the new low boundary to be
efficiently implemented with just a minor addition to the generic
reclaim code, without the need for out-of-band data structures and
reclaim passes. Because the generic reclaim code considers all
cgroups except for the ones running low in the preferred first
reclaim pass, overreclaim of individual groups is eliminated as
well, resulting in much better overall workload performance.
- The original high boundary, the hard limit, is defined as a strict
limit that can not budge, even if the OOM killer has to be called.
But this generally goes against the goal of making the most out of
the available memory. The memory consumption of workloads varies
during runtime, and that requires users to overcommit. But doing
that with a strict upper limit requires either a fairly accurate
prediction of the working set size or adding slack to the limit.
Since working set size estimation is hard and error prone, and
getting it wrong results in OOM kills, most users tend to err on
the side of a looser limit and end up wasting precious resources.
The memory.high boundary on the other hand can be set much more
conservatively. When hit, it throttles allocations by forcing
them into direct reclaim to work off the excess, but it never
invokes the OOM killer. As a result, a high boundary that is
chosen too aggressively will not terminate the processes, but
instead it will lead to gradual performance degradation. The user
can monitor this and make corrections until the minimal memory
footprint that still gives acceptable performance is found.
In extreme cases, with many concurrent allocations and a complete
breakdown of reclaim progress within the group, the high boundary
can be exceeded. But even then it's mostly better to satisfy the
allocation from the slack available in other groups or the rest of
the system than killing the group. Otherwise, memory.max is there
to limit this type of spillover and ultimately contain buggy or
even malicious applications.
- The original control file names are unwieldy and inconsistent in
many different ways. For example, the upper boundary hit count is
exported in the memory.failcnt file, but an OOM event count has to
be manually counted by listening to memory.oom_control events, and
lower boundary / soft limit events have to be counted by first
setting a threshold for that value and then counting those events.
Also, usage and limit files encode their units in the filename.
That makes the filenames very long, even though this is not
information that a user needs to be reminded of every time they
type out those names.
To address these naming issues, as well as to signal clearly that
the new interface carries a new configuration model, the naming
conventions in it necessarily differ from the old interface.
- The original limit files indicate the state of an unset limit with
a very high number, and a configured limit can be unset by echoing
-1 into those files. But that very high number is implementation
and architecture dependent and not very descriptive. And while -1
can be understood as an underflow into the highest possible value,
-2 or -10M etc. do not work, so it's not inconsistent.
memory.low, memory.high, and memory.max will use the string
"infinity" to indicate and set the highest possible value.
[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unified hierarchy interface for memory cgroups will no longer use "-1"
to mean maximum possible resource value. In preparation for this, make
the string an argument and let the caller supply it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit b2052564e6 ("mm: memcontrol: continue cache reclaim from
offlined groups") pages charged to a memory cgroup are not reparented when
the cgroup is removed. Instead, they are supposed to be reclaimed in a
regular way, along with pages accounted to online memory cgroups.
However, an lruvec of an offline memory cgroup will sooner or later get so
small that it will be scanned only at low scan priorities (see
get_scan_count()). Therefore, if there are enough reclaimable pages in
big lruvecs, pages accounted to offline memory cgroups will never be
scanned at all, wasting memory.
Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
via extra parameter. Since the latter can be trivially obtained by
dereferencing the former, the overhead of the extra parameter is
unjustified.
This patch thus removes the zone parameter from next_zones_zonelist().
Both callers happen to be in the same header file, so it's simple to add
the zoneref dereference inline. We save some bytes of code size.
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
function old new delta
nr_free_zone_pages 129 115 -14
__alloc_pages_nodemask 2300 2285 -15
get_page_from_freelist 2652 2576 -76
add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
function old new delta
try_to_compact_pages 569 579 +10
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expand the usage of the struct alloc_context introduced in the previous
patch also for calling try_to_compact_pages(), to reduce the number of its
parameters. Since the function is in different compilation unit, we need
to move alloc_context definition in the shared mm/internal.h header.
With this change we get simpler code and small savings of code size and stack
usage:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function old new delta
__alloc_pages_direct_compact 283 256 -27
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
function old new delta
try_to_compact_pages 582 569 -13
Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
scripts/checkstack.pl).
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d9 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The complexity of memcg page stat synchronization is currently leaking
into the callsites, forcing them to keep track of the move_lock state and
the IRQ flags. Simplify the API by tracking it in the memcg.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The body of this function was removed by commit 0a31bc97c8 ("mm:
memcontrol: rewrite uncharge API").
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add VM_BUG_ON_PAGE() for slab pages. _mapcount is an union with slab
struct in struct page, so we must avoid accessing _mapcount if this page
is a slab page. Also remove the unneeded bracket.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we use lru.next/lru.prev plus cast to access or set
destructor and order of compound page.
Let's replace it with explicit fields in struct page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch resolves the following warnings.
include/trace/events/f2fs.h:150:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:180:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:150:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:180:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/f2fs/checkpoint.c:27:19: warning: symbol 'inode_entry_slab' was not declared. Should it be static?
fs/f2fs/checkpoint.c:577:15: warning: cast to restricted __le32
fs/f2fs/checkpoint.c:592:15: warning: cast to restricted __le32
fs/f2fs/trace.c:19:1: warning: symbol 'pids' was not declared. Should it be static?
fs/f2fs/trace.c:21:21: warning: symbol 'last_io' was not declared. Should it be static?
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two macros for transition between byte and block offsets.
Currently, f2fs only supports 4KB blocks, so use the default size for now.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds FASTBOOT flag into checkpoint as follows.
- CP_UMOUNT_FLAG is set when system is umounted.
- CP_FASTBOOT_FLAG is set when intermediate checkpoint having node summaries
was done.
So, if you get CP_UMOUNT_FLAG from checkpoint, the system was umounted cleanly.
Instead, if there was sudden-power-off, you can get CP_FASTBOOT_FLAG or nothing.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, there are several variables with Boolean type as below:
struct f2fs_sb_info {
...
int s_dirty;
bool need_fsck;
bool s_closing;
...
bool por_doing;
...
}
For this there are some issues:
1. there are some space of f2fs_sb_info is wasted due to aligning after Boolean
type variables by compiler.
2. if we continuously add new flag into f2fs_sb_info, structure will be messed
up.
So in this patch, we try to:
1. switch s_dirty to Boolean type variable since it has two status 0/1.
2. merge s_dirty/need_fsck/s_closing/por_doing variables into s_flag.
3. introduce an enum type which can indicate different states of sbi.
4. use new introduced universal interfaces is_sbi_flag_set/{set,clear}_sbi_flag
to operate flags for sbi.
After that, above issues will be fixed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.
Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload. Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the free and same_flow fields to be bit fields
(2 and 1 bit sized respectively). This frees up some space for u16's.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current meaning of CHECKSUM_PARTIAL for validating checksums
is that _all_ checksums in the packet are considered valid.
However, in the manner that CHECKSUM_PARTIAL is set only the checksum
at csum_start+csum_offset and any preceding checksums may
be considered valid. If there are checksums in the packet after
csum_offset it is possible they have not been verfied.
This patch changes CHECKSUM_PARTIAL logic in skb_csum_unnecessary and
__skb_gro_checksum_validate_needed to only considered checksums
referring to csum_start and any preceding checksums (with starting
offset before csum_start) to be verified.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.
To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.
In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the normal {} instead of a macro to terminate an array.
Remove the macro too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the normal {} instead of a macro to terminate an array.
Remove the macro too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack. While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit
971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB(). Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.
Cc: <stable@vger.kernel.org> # 3.18
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
- Framework changes and enhancements:
- Passing -DDEBUG recursively to subdir drivers so we get
debug messages properly turned on.
- Infer map type from DT property in the groups parsing code
in the generic pinconfig code.
- Support for custom parameter passing in generic pin config.
This is used when you are using the generic pin config, but
want to add a few custom properties that no other driver
will use.
- New drivers:
- Driver for the Xilinx Zynq
- Driver for the AmLogic Meson SoCs
- New features in drivers:
- Sleep support (suspend/resume) for the Cherryview driver
- mvebeu a38x can now mux a UART on pins MPP19 and MPP20
- Migrated the qualcomm driver to generic pin config handling
of extended config options in the core code.
- Support BUS1 and AUDIO in the Exynos pin controller.
- Add some missing functions in the sun6i driver.
- Add support for the A31S variant in the sun6i driver.
- EMEv2 support in the Renesas PFC driver.
- Ass support for Qualcomm MSM8916 in the qcom driver.
- Deleted features
- Drop support for the SiRF Marco that was never released to
the market.
- Drop SH7372 support as the support for this platform is
removed from the kernel.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2w2GAAoJEEEQszewGV1z7kYQAKw4Y4SY9Mq9O97GBq0JWvzv
uLK4P8NvegHkgX0IDc/etAtHzBN6L+4axh7rDAsaDhug+42CbZxVZjXCfLGFClZP
kJ/gz4II28AGWiP7TZPNHspIJYgKUdWcVfg0cTZwpM22/AEdBAo9HQ2a/FltvrCn
eYwzFlAOKUUygmDdbfXHk5Z+ndrJw0ahLjXn8zjBe1HkD2QVaigM9ecA2aQHiG9a
QvABUJ2qVVs9rqTIxoVzSIGTLeLzrv8cezDLQhZ4KaEasAkxtWKM4kYQSMx/PoTB
mg+FZ5B8IXqlksnSljT+wOcSP1nmtRdjnED/MpsSLbo9RfJgHkA4Lu4Q8iqt7rZL
+k/kKi3+p9pTE2pIi56nSpHnfgF8JHgdRAYIXBea5Ug0YnBp83/5jLrU7Fmcjr7s
l0PH0Fk0iRFRdfn6crcs+SLrhQKtuP+Douwg+3ujVOQiKIW6m+b161GwEVYkvWlq
1JRWPSjncpsmyg5O8dEwZDwgtzPU65UMEsLgRk9wMNJYw0TqGPugEy4+2rBdJWLy
CYzJo2At9OcHbB2rT8UKwtErQF85IcWmfnMyfo2PANTLGaj5EFruhmSc2J0m7kOe
ExPXtWOWxGCtWG53ZDeJUYBg9ySMyleY10LYBP9fPQnotvNB3vfyAkBwV74fnBXs
ijxO6/Uamd4Rrs5LDRBL
=Nbzh
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pincontrol updates from Linus Walleij:
:This is the bulk of pin control changes for the v3.20 cycle:
Framework changes and enhancements:
- Passing -DDEBUG recursively to subdir drivers so we get debug
messages properly turned on.
- Infer map type from DT property in the groups parsing code in the
generic pinconfig code.
- Support for custom parameter passing in generic pin config. This
is used when you are using the generic pin config, but want to add
a few custom properties that no other driver will use.
New drivers:
- Driver for the Xilinx Zynq
- Driver for the AmLogic Meson SoCs
New features in drivers:
- Sleep support (suspend/resume) for the Cherryview driver
- mvebeu a38x can now mux a UART on pins MPP19 and MPP20
- Migrated the qualcomm driver to generic pin config handling of
extended config options in the core code.
- Support BUS1 and AUDIO in the Exynos pin controller.
- Add some missing functions in the sun6i driver.
- Add support for the A31S variant in the sun6i driver.
- EMEv2 support in the Renesas PFC driver.
- Add support for Qualcomm MSM8916 in the qcom driver.
Deleted features
- Drop support for the SiRF Marco that was never released to the
market.
- Drop SH7372 support as the support for this platform is removed
from the kernel"
* tag 'pinctrl-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (40 commits)
sh-pfc: emev2 - Fix mangled author name
pinctrl: cherryview: Configure HiZ pins to be input when requested as GPIOs
pinctrl: imx25: fix numbering for pins
pinctrl: pinctrl-imx: don't use invalid value of conf_reg
pinctrl: qcom: delete pin_config_get/set pinconf operations
pinctrl: qcom: Add msm8916 pinctrl driver
DT: pinctrl: Document Qualcomm MSM8916 pinctrl binding
pinctrl: qcom: increase variable size for register offsets
pinctrl: hide PCONFDUMP in #ifdef
pinctrl: rockchip: Only mask interrupts; never disable
pinctrl: zynq: Fix usb0 pins
pinctrl: sh-pfc: sh7372: Remove DT binding documentation
pinctrl: sh-pfc: sh7372: Remove PFC support
sh-pfc: Add emev2 pinmux support
sh-pfc: add macro to define pinmux without function
pinctrl: add driver for Amlogic Meson SoCs
staging: drivers: pinctrl: Fixed checkpatch.pl warnings
pinctrl: exynos: Add AUDIO pin controller for exynos7
sh-pfc: r8a7790: add MLB+ pin group
sh-pfc: r8a7791: add MLB+ pin group
...
- GPIOLIB core changes:
- Create and use of_mm_gpiochip_remove() for removing
memory-mapped OF GPIO chips
- GPIO MMIO library suppports bgpio_set_multiple for
switching several lines at once, a feature merged in
the last cycle.
- New drivers:
- New driver for the APM X-gene standby GPIO controller
- New driver for the Fujitsu MB86S7x GPIO controller
- Cleanups:
- Moved rcar driver to use gpiolib irqchip
- Moxart converted to the GPIO MMIO library
- GE driver converted to GPIO MMIO library
- Move sx150x to irqdomain
- Move max732x to irqdomain
- Move vx855 to use managed resources
- Move dwapb to use managed resources
- Clean tc3589x from platform data
- Clean stmpe driver to use device tree only probe
- New subtypes:
- sx1506 support in the sx150x driver
- Quark 1000 SoC support in the SCH driver
- Support X86 in the Xilinx driver
- Support PXA1928 in the PXA driver
- Extended drivers:
- max732x supports device tree probe
- sx150x supports device tree probe
- Various minor cleanups and bug fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2W1HAAoJEEEQszewGV1zuRkQAKBukQwx46zQuFH8oalSuxmy
H2/ESMnDmRlBI0zX+dAGDg4HlkO9VwIrdcyzMMe7uO8EFUu9d4/mu2E1f8cY2mxu
kwUSntVaYGVPVj/Vok3kzq1wW/pUQ9E2iMbNeDVZJH/cqEvylPPa2LZ3hZVri57J
orJzaYtZWf640y3McGTUDwIQokgxxdMMyWfm26P1iZkByjofUaYRHS1NIxhKVSVC
Y6uA/Ivvh56ezlPQykc7m6YEjoUS91AMllJca1A3KF3+qvQ3Hnc+i9mB7hAQbJ8L
6Iv2qCD4TgtWT5YXgP0eZsfSV/19kvlVlGX7QQT6o7uIFOLVNOmX9FJAc4n3SGNI
HbxsDdlF9XHVFh0XyKcBQfg0uUmRWUDQ/LsXn9KL6Yrpyx3QZGH/PTYl/XoCj1IO
IL6Bw51FCBXvCvLP5alXMouosAxrc19YpljcAuAMmjVTXe6RoULOZJs+lhTHMhvj
S6FWr6l0XDr4yKb0SXTOGI4hvRJvL8SWnIt5ezZrNrTKLsklL3Bi/o3BoKzzxmqS
6l/rxvcVUov455IrfqJ0ZEM6ZxC9/cvGec0RFyglopNSlKeTQsgWRNa6Pw2xZVk8
orT+G3L4jZ1+0hcjcSwfvHng5Nv6DuhIfUNPbLF/ZzmJcBPqGc/DuhGkFqOrBaXg
ey7Lua6+l+8zcDtugRRG
=yWrB
-----END PGP SIGNATURE-----
Merge tag 'gpio-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO changes from Linus Walleij:
"This is the GPIO bulk changes for the v3.20 series:
GPIOLIB core changes:
- Create and use of_mm_gpiochip_remove() for removing memory-mapped
OF GPIO chips
- GPIO MMIO library suppports bgpio_set_multiple for switching
several lines at once, a feature merged in the last cycle.
New drivers:
- New driver for the APM X-gene standby GPIO controller
- New driver for the Fujitsu MB86S7x GPIO controller
Cleanups:
- Moved rcar driver to use gpiolib irqchip
- Moxart converted to the GPIO MMIO library
- GE driver converted to GPIO MMIO library
- Move sx150x to irqdomain
- Move max732x to irqdomain
- Move vx855 to use managed resources
- Move dwapb to use managed resources
- Clean tc3589x from platform data
- Clean stmpe driver to use device tree only probe
New subtypes:
- sx1506 support in the sx150x driver
- Quark 1000 SoC support in the SCH driver
- Support X86 in the Xilinx driver
- Support PXA1928 in the PXA driver
Extended drivers:
- max732x supports device tree probe
- sx150x supports device tree probe
Various minor cleanups and bug fixes"
* tag 'gpio-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (61 commits)
gpio: kconfig: replace PPC_OF with PPC
gpio: pxa: add PXA1928 gpio type support
dt/bindings: gpio: add compatible string for marvell,pxa1928-gpio
gpio: pxa: remove mach IRQ includes
gpio: max732x: use an inline function for container cast
gpio: use sizeof() instead of hardcoded values
gpio: max732x: add set_multiple function
gpio: sch: Consolidate similar algorithms
gpio: tz1090-pdc: Use resource_size to fix off-by-one resource size calculation
gpio: ge: Convert to use devm_kstrdup
gpio: correctly use const char * const
gpio: sx150x: fixup OF support
gpio: mpc8xxx: Use of_mm_gpiochip_remove
gpio: Add Fujitsu MB86S7x GPIO driver
gpio: mpc8xxx: Convert to platform device interface.
gpio: zevio: Use of_mm_gpiochip_remove
gpio: gpio-mm-lantiq: Use of_mm_gpiochip_remove
gpio: gpio-mm-lantiq: Use of_property_read_u32
gpio: gpio-mm-lantiq: Do not replicate code
gpio :gpio-mm-lantiq: Use devm_kzalloc
...
- Support for MMC power sequences.
- SDIO function devicetree subnode parsing.
- Refactor the hardware reset routines and enable it for SD cards.
- Various code quality improvements, especially for slot-gpio.
MMC host:
- dw_mmc: Various fixes and cleanups.
- dw_mmc: Convert to mmc_send_tuning().
- moxart: Fix probe logic.
- sdhci: Various fixes and cleanups
- sdhci: Asynchronous request handling support.
- sdhci-pxav3: Various fixes and cleanups.
- sdhci-tegra: Fixes for T114, T124 and T132.
- rtsx: Various fixes and cleanups.
- rtsx: Support for SDIO.
- sdhi/tmio: Refactor and cleanup of header files.
- omap_hsmmc: Use slot-gpio and common MMC DT parser.
- Make all hosts to deal with errors from mmc_of_parse().
- sunxi: Various fixes and cleanups.
- sdhci: Support for Fujitsu SDHCI controller f_sdh30.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2b9MAAoJEP4mhCVzWIwp678P/2Hjoo17FDnCQT2qXCRWmMmx
98n7mrkPw20cVm6dlXyVxHFxrgRWan1eATiu1vBdnNmXkeUmThMbuGpATDi40fIT
C2g9wPDM1/naJ+Qg8mPGy0vEDQYHEzxHHlAyfOaeXdhxhll1iHqhk+Jb6cFQN5DP
/CvNmuL/7m9uuFhHlGJnqSNMyenLAFFXthIiVJrQeZeYq9NZ1ZZfW7+esHDmu2lP
EFkrZf+xYFmFWAqccyTR58QZsYKlDv4NS/0UMU941DkO7x7R8ZsQG8xFu9bIN5Wn
EJfgP7EfEXHlD5a1/QQ918IT1ifxhPGiCbBXpdfAUt7Xte6zYyASpTyAm8v7vT2I
2hot1T1BZgADALE2EHAP4kzK49ipfhQmlVZgFeYVsTpPKk8Nvczio7Y3LYlzNmBo
V0jaTUTtU7u7ICtGbo7OqOybW/Sm5E00xsq22txIXObURa7bPbZ4CnxJpstSaU2Z
nweZaa79HaHZE7xyUNh9kAbxfGC0pOT0oPoPYcTxcpk2vva+atULEYnLEHUULrgs
D4+m8tnbuwoZoGanlMKqgPXP8Xkau/meEdz4WaYrXQEIafrVIR2/kcXGQjhD8ucO
VkjUaZDKxNXTkwOzM/siOxJwj75Ka6GDHM7JGx4F30QHqgRTtg2wzInU9nsViuiA
02698dNk9CdP3JirDtbm
=ojsj
-----END PGP SIGNATURE-----
Merge tag 'mmc-v3.20-1' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Support for MMC power sequences.
- SDIO function devicetree subnode parsing.
- Refactor the hardware reset routines and enable it for SD cards.
- Various code quality improvements, especially for slot-gpio.
MMC host:
- dw_mmc: Various fixes and cleanups.
- dw_mmc: Convert to mmc_send_tuning().
- moxart: Fix probe logic.
- sdhci: Various fixes and cleanups
- sdhci: Asynchronous request handling support.
- sdhci-pxav3: Various fixes and cleanups.
- sdhci-tegra: Fixes for T114, T124 and T132.
- rtsx: Various fixes and cleanups.
- rtsx: Support for SDIO.
- sdhi/tmio: Refactor and cleanup of header files.
- omap_hsmmc: Use slot-gpio and common MMC DT parser.
- Make all hosts to deal with errors from mmc_of_parse().
- sunxi: Various fixes and cleanups.
- sdhci: Support for Fujitsu SDHCI controller f_sdh30"
* tag 'mmc-v3.20-1' of git://git.linaro.org/people/ulf.hansson/mmc: (117 commits)
mmc: sdhci-s3c: solve problem with sleeping in atomic context
mmc: pwrseq: add driver for emmc hardware reset
mmc: moxart: fix probe logic
mmc: core: Invoke mmc_pwrseq_post_power_on() prior MMC_POWER_ON state
mmc: pwrseq_simple: Add optional reference clock support
mmc: pwrseq: Document optional clock for the simple power sequence
mmc: pwrseq_simple: Extend to support more pins
mmc: pwrseq: Document that simple sequence support more than one GPIO
mmc: Add hardware dependencies for sdhci-pxav3 and sdhci-pxav2
mmc: sdhci-pxav3: Modify clock settings for the SDR50 and DDR50 modes
mmc: sdhci-pxav3: Extend binding with SDIO3 conf reg for the Armada 38x
mmc: sdhci-pxav3: Fix Armada 38x controller's caps according to erratum ERR-7878951
mmc: sdhci-pxav3: Fix SDR50 and DDR50 capabilities for the Armada 38x flavor
mmc: sdhci: switch voltage before sdhci_set_ios in runtime resume
mmc: tegra: Write xfer_mode, CMD regs in together
mmc: Resolve BKOPS compatability issue
mmc: sdhci-pxav3: fix setting of pdata->clk_delay_cycles
mmc: dw_mmc: rockchip: remove incorrect __exit_p()
mmc: dw_mmc: exynos: remove incorrect __exit_p()
mmc: Fix menuconfig alignment of MMC_SDHCI_* options
...
This is the usual grab bag of driver updates (hpsa, storvsc, mp2sas,
megaraid_sas, ses) plus an assortment of minor updates. There's also an
update to ufs which adds new phy drivers and finally a new logging
infrastructure for SCSI.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJU2Ty5AAoJEDeqqVYsXL0M9rAH/1xNpAxXuxQq+dW5Z+uOaX60
5RRIu7/xA1HEfzkT5FTHrolmogDjVqawu4PZS66iHDeo05RBVUlbTA8qCK+MlRcN
U6s0cLEw59eH3EaCfOGuYp/MnbhuV0eNxe0btmqJIQwuW3+gwZKGJdOq6LS2YasJ
k/DyIBVmkJAVsN56vm9q2vbtcZp+Bg+ngqBS+SC4TF7vV1WCtFmS6yaUf62PYW3D
+Irx37qHZntDR5wdw3dsuKDi5U8bl6myPjaVLnVJqg/WIF9RlCkjk5xpWT99AmVO
NmtYQxLLBlAQ5K+sIlBUwxZe+8q1l+Aj4TTmJHAfFtyfp25s7JR9I6/QtOyC5Kw=
=odol
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"This is the usual grab bag of driver updates (hpsa, storvsc, mp2sas,
megaraid_sas, ses) plus an assortment of minor updates.
There's also an update to ufs which adds new phy drivers and finally a
new logging infrastructure for SCSI"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (114 commits)
scsi_logging: return void for dev_printk() functions
scsi: print single-character strings with seq_putc
scsi: merge consecutive seq_puts calls
scsi: replace seq_printf with seq_puts
aha152x: replace seq_printf with seq_puts
advansys: replace seq_printf with seq_puts
scsi: remove SPRINTF macro
sg: remove an unused variable
hpsa: Use local workqueues instead of system workqueues
hpsa: add in P840ar controller model name
hpsa: add in gen9 controller model names
hpsa: detect and report failures changing controller transport modes
hpsa: shorten the wait for the CISS doorbell mode change ack
hpsa: refactor duplicated scan completion code into a new routine
hpsa: move SG descriptor set-up out of hpsa_scatter_gather()
hpsa: do not use function pointers in fast path command submission
hpsa: print CDBs instead of kernel virtual addresses for uncommon errors
hpsa: do not use a void pointer for scsi_cmd field of struct CommandList
hpsa: return failed from device reset/abort handlers
hpsa: check for ctlr lockup after command allocation in main io path
...
Pull input updates from Dmitry Torokhov:
"The first round of updates for the input subsystem.
A few new drivers (power button handler for AXP20x PMIC, tps65218
power button driver, sun4i keys driver, regulator haptic driver, NI
Ettus Research USRP E3x0 button, Alwinner A10/A20 PS/2 controller).
Updates to Synaptics and ALPS touchpad drivers (with more to come
later), brand new Focaltech PS/2 support, update to Cypress driver to
handle Gen5 (in addition to Gen3) devices, and number of other fixups
to various drivers as well as input core"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: elan_i2c - fix wrong %p extension
Input: evdev - do not queue SYN_DROPPED if queue is empty
Input: gscps2 - fix MODULE_DEVICE_TABLE invocation
Input: synaptics - use dmax in input_mt_assign_slots
Input: pxa27x_keypad - remove unnecessary ARM includes
Input: ti_am335x_tsc - replace delta filtering with median filtering
ARM: dts: AM335x: Make charge delay a DT parameter for TSC
Input: ti_am335x_tsc - read charge delay from DT
Input: ti_am335x_tsc - remove udelay in interrupt handler
Input: ti_am335x_tsc - interchange touchscreen and ADC steps
Input: MT - add support for balanced slot assignment
Input: drv2667 - remove wrong and unneeded drv2667-haptics modalias
Input: drv260x - remove wrong and unneeded drv260x-haptics modalias
Input: cap11xx - remove wrong and unneeded cap11xx modalias
Input: sun4i-ts - add support for touchpanel controller on A31
Input: serio - add support for Alwinner A10/A20 PS/2 controller
Input: gtco - use sign_extend32() for sign extension
Input: elan_i2c - verify firmware signature applying it
Input: elantech - remove stale comment from Kconfig
Input: cyapa - off by one in cyapa_update_fw_store()
...
In this batch, you can find lots of cleanups through the whole
subsystem, as our good New Year's resolution. Lots of LOCs and
commits are about LINE6 driver that was promoted finally from staging
tree, and as usual, there've been widely spread ASoC changes.
Here some highlights:
ALSA core changes
- Embedding struct device into ALSA core structures
- sequencer core cleanups / fixes
- PCM msbits constraints cleanups / fixes
- New SNDRV_PCM_TRIGGER_DRAIN command
- PCM kerneldoc fixes, header cleanups
- PCM code cleanups using more standard codes
- Control notification ID fixes
Driver cleanups
- Cleanups of PCI PM callbacks
- Timer helper usages cleanups
- Simplification (e.g. argument reduction) of many driver codes
HD-audio
- Hotkey and LED support on HP laptops with Realtek codecs
- Dock station support on HP laptops
- Toshiba Satellite S50D fixup
- Enhanced wallclock timestamp handling for HD-audio
- Componentization to simplify the linkage between i915 and hd-audio
drivers for Intel HDMI/DP
USB-audio
- Akai MPC Element support
- Enhanced timestamp handling
ASoC
- Lots of refactoringin ASoC core, moving drivers to more data
driven initialization and rationalizing a lot of DAPM usage
- Much improved handling of CDCLK clocks on Samsung I2S controllers
- Lots of driver specific cleanups and feature improvements
- CODEC support for TI PCM514x and TLV320AIC3104 devices
- Board support for Tegra systems with Realtek RT5677
- New driver for Maxim max98357a
- More enhancements / fixes for Intel SST driver
Others
- Promotion of LINE6 driver from staging along with lots of rewrites
and cleanups
- DT support for old non-ASoC atmel driver
- oxygen cleanups, XIO2001 init, Studio Evolution SE6x support
- Emu8000 DRAM size detection fix on ISA(!!) AWE64 boards
- A few more ak411x fixes for ice1724 boards
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU2ySIAAoJEGwxgFQ9KSmk/YMP/1v0r60aDt6VxlTbadt008R/
jTEIzD4oGEMkhFQdlmN8MegZlx+05vxQUCGrVy8PLelhfy/mnj6z/iUt9ohE1PqK
530eVr5FnlAbHs1JzP8Tm8Xbbtk8RXt5uvgohJvt7HBrc0Def9N/w37fUQ0ytO+s
Ot/0Xm8BNsdJ90DfMVLc0Ok9cAFn4Z70gylE/PuGxbBBzxQh8PYPXtJ6Q/s5lKLV
QC7VitJa0H7vsFYb+Ve7GU4cKMTt8uEPw8CdnQbDwb63ia93iWJJrlqKVUWYF2Gu
K+mX5Igdb88ToXbMPrLKXe73IfFcdpWNTbj8IAv+Rp9fArylzz+3GAYmrqTAdare
JEE5qAZTtJZEeD2vgNCnA4JpSbRzL0bHrEow21LnPONq3V9FB044NAeMSx3dI4j1
fk+SnqrpJMtlCtgj2PuWzIcqRiJ25F/Qax/xFeZHo7FwLIBF7z5pLu9DP4CfUSXj
fDEcB9aNF2VirJkQdbhHaPqTYVf2rHQ/ebDpDHBwkqFe865IHlJ8g8MrHnAFInKN
jQlSTOqi9V3two53U1JIKcB6QcBH3vh60w2JsWsQadsr45YYQ/bvBHGYNpQ00C3U
rbDBANhAHaF/hFncNnOQDsH65FqHj/ZlBQRhzX0LqxN4K0DM1FqGcLf2k6u/pzZU
09+QlcIOOzN8lbvHR8Qx
=/84r
-----END PGP SIGNATURE-----
Merge tag 'sound-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"In this batch, you can find lots of cleanups through the whole
subsystem, as our good New Year's resolution. Lots of LOCs and
commits are about LINE6 driver that was promoted finally from staging
tree, and as usual, there've been widely spread ASoC changes.
Here some highlights:
ALSA core changes
- Embedding struct device into ALSA core structures
- sequencer core cleanups / fixes
- PCM msbits constraints cleanups / fixes
- New SNDRV_PCM_TRIGGER_DRAIN command
- PCM kerneldoc fixes, header cleanups
- PCM code cleanups using more standard codes
- Control notification ID fixes
Driver cleanups
- Cleanups of PCI PM callbacks
- Timer helper usages cleanups
- Simplification (e.g. argument reduction) of many driver codes
HD-audio
- Hotkey and LED support on HP laptops with Realtek codecs
- Dock station support on HP laptops
- Toshiba Satellite S50D fixup
- Enhanced wallclock timestamp handling for HD-audio
- Componentization to simplify the linkage between i915 and hd-audio
drivers for Intel HDMI/DP
USB-audio
- Akai MPC Element support
- Enhanced timestamp handling
ASoC
- Lots of refactoringin ASoC core, moving drivers to more data driven
initialization and rationalizing a lot of DAPM usage
- Much improved handling of CDCLK clocks on Samsung I2S controllers
- Lots of driver specific cleanups and feature improvements
- CODEC support for TI PCM514x and TLV320AIC3104 devices
- Board support for Tegra systems with Realtek RT5677
- New driver for Maxim max98357a
- More enhancements / fixes for Intel SST driver
Others
- Promotion of LINE6 driver from staging along with lots of rewrites
and cleanups
- DT support for old non-ASoC atmel driver
- oxygen cleanups, XIO2001 init, Studio Evolution SE6x support
- Emu8000 DRAM size detection fix on ISA(!!) AWE64 boards
- A few more ak411x fixes for ice1724 boards"
* tag 'sound-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (542 commits)
ALSA: line6: toneport: Use explicit type for firmware version
ALSA: line6: Use explicit type for serial number
ALSA: line6: Return EIO if read/write not successful
ALSA: line6: Return error if device not responding
ALSA: line6: Add delay before reading status
ASoC: Intel: Clean data after SST fw fetch
ALSA: hda - Add docking station support for another HP machine
ALSA: control: fix failure to return new numerical ID in 'replace' event data
ALSA: usb: update trigger timestamp on first non-zero URB submitted
ALSA: hda: read trigger_timestamp immediately after starting DMA
ALSA: pcm: allow for trigger_tstamp snapshot in .trigger
ALSA: pcm: don't override timestamp unconditionally
ALSA: off by one bug in snd_riptide_joystick_probe()
ASoC: rt5670: Set use_single_rw flag for regmap
ASoC: rt286: Add rt288 codec support
ASoC: max98357a: Fix build in !CONFIG_OF case
ASoC: Intel: fix platform_no_drv_owner.cocci warnings
ARM: dts: Switch Odroid X2/U2 to simple-audio-card
ARM: dts: Exynos4 and Odroid X2/U3 sound device nodes update
ALSA: control: fix failure to return numerical ID in 'add' event
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2NQQAAoJEAhfPr2O5OEV5BgQAIja/XsIgpeNhfN8kJ3GrdhL
Z+QRTcHNc6AWGm1dkI+YTl4B38/xLlmxhUYPKsDl19N7n1oKkqdUxYtLe1mLdecW
dvqMXMVBKQSCgyDP5sgZNHKlavEX1ZPTTtkrY8zYWaXbkcf4dOZyisbNQrmFdO3T
wt4zwaO8+ziCEYbotLsaI1VpEDKFZV6AVhKnLsWxc4ZoCnAqJbmA31jtANxrQ0tw
UgXRjJmf1uWrS+MWM5xFDi+v+FmZiUAHMJ5iksqWhp2pKj41geIqy7lAueytEN+Q
vQHZ9cfhnoF/7VrqDtqq5CaJZPKfA80PSxml9mbjc4wytvWLevoc4UxFtU+lohOf
YbM3nB5J3nAcq0bNF/cSpuYUoiGnK86FazuM6YAQy2CaucrVKALKHHmziWbK6gBv
1yA4qnDuRYKps3SQSQQKuNlv8dmcVTD/sVhf8EIx62son6xxeXf21nas61lw8k5P
lrUVH9nJxkwTkRJ7wMjlAZeh0pTyB/Ag1bSn81myziv0r4AsNyWJT5qxN8szmZDe
nXGIdQ1h5JkMQ0kCfhhLqgdIUwhx7dMXIlXcCfR/8a9uYm4StegPNCEZDybIi6co
8Ok3rPYt15PlrCyfMjXFOG/TYi/cZ/xIbffLbSFMOqnCUZElaA7RNpOnswNc9fc6
2WsY54Lb4ftC4bQ7hM90
=VH6m
-----END PGP SIGNATURE-----
Merge tag 'media/v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Some documentation updates and a few new pixel formats
- Stop btcx-risc abuse by cx88 and move it to bt8xx driver
- New platform driver: am437x
- New webcam driver: toptek
- New remote controller hardware protocols added to img-ir driver
- Removal of a few very old drivers that relies on old kABIs and are
for very hard to find hardware: parallel port webcam drivers
(bw-qcam, c-cam, pms and w9966), tlg2300, Video In/Out for SGI (vino)
- Removal of the USB Telegent driver (tlg2300). The company that
developed this driver has long gone and the hardware is hard to find.
As it relies on a legacy set of kABI symbols and nobody seems to care
about it, remove it.
- several improvements at rtl2832 driver
- conversion on cx28521 and au0828 to use videobuf2 (VB2)
- several improvements, fixups and board additions
* tag 'media/v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (321 commits)
[media] dvb_net: Convert local hex dump to print_hex_dump_debug
[media] dvb_net: Use standard debugging facilities
[media] dvb_net: Use vsprintf %pM extension to print Ethernet addresses
[media] staging: lirc_serial: adjust boolean assignments
[media] stb0899: use sign_extend32() for sign extension
[media] si2168: add support for 1.7MHz bandwidth
[media] si2168: return error if set_frontend is called with invalid parameters
[media] lirc_dev: avoid potential null-dereference
[media] mn88472: simplify bandwidth registers setting code
[media] dvb: tc90522: re-add symbol-rate report
[media] lmedm04: add read snr, signal strength and ber call backs
[media] lmedm04: Create frontend call back for read status
[media] lmedm04: create frontend callbacks for signal/snr/ber/ucblocks
[media] lmedm04: Fix usb_submit_urb BOGUS urb xfer, pipe 1 != type 3 in interrupt urb
[media] lmedm04: Increase Interupt due time to 200 msec
[media] cx88-dvb: whitespace cleanup
[media] rtl28xxu: properly initialize pdata
[media] rtl2832: declare functions as static
[media] rtl2830: declare functions as static
[media] rtl2832_sdr: add kernel-doc comments for platform_data
...
This patch is based on exynos-drm-next branch of Inki Dae's tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git
DECON(Display and Enhancement Controller) is the new IP
in exynos7 SOC for generating video signals using pixel data.
DECON driver can be used to drive 2 different interfaces on Exynos7:
DECON-INT(video controller) and DECON-EXT(Mixer for HDMI)
The existing FIMD driver code was used as a template to create
DECON driver. Only DECON-INT is supported as of now, and
DECON-EXT support will be added later.
The current version of the driver supports video mode displays.
Changelog v2:
- Change config name, DRM_EXYNOS_DECON to DRM_EXYNOS7_DECON.
Signed-off-by: Akshu Agrawal <akshua@gmail.com>
Signed-off-by: Ajay Kumar <ajaykumar.rs@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
Disappointing, as this was kind of neat (especially getting to use RCU
to manage the address -> eventfd mapping). But now the devices are PCI
handled in userspace, we get rid of both the NOTIFY hypercall and
the interface to connect an eventfd.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use the ptrace API struct, and we currently don't let them set
anything but the normal registers (we'd have to filter the others).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Flushing out my drm-misc queue with a few oddball things all over.
* tag 'topic/drm-misc-2015-02-06' of git://anongit.freedesktop.org/drm-intel:
drm: Use static attribute groups for managing connector sysfs entries
drm: remove DRM_FORMAT_NV12MT
drm/modes: Print the mode status in human readable form
drm/irq: Don't disable vblank interrupts when already disabled
The VIRTIO_F_ANY_LAYOUT and VIRTIO_F_NOTIFY_ON_EMPTY features are pre-1.0
only.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
This allows modern implementations to ensure they don't use legacy
feature bits or SCSI commands (which are not used in v1.0 non-legacy).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
This provides backdoor access to the device MMIOs, and every device should
have one. From the virtio 1.0 spec (CS03):
4.1.4.7.1 Device Requirements: PCI configuration access capability
The device MUST present at least one VIRTIO_PCI_CAP_PCI_CFG capability.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Pull networking updates from David Miller:
1) More iov_iter conversion work from Al Viro.
[ The "crypto: switch af_alg_make_sg() to iov_iter" commit was
wrong, and this pull actually adds an extra commit on top of the
branch I'm pulling to fix that up, so that the pre-merge state is
ok. - Linus ]
2) Various optimizations to the ipv4 forwarding information base trie
lookup implementation. From Alexander Duyck.
3) Remove sock_iocb altogether, from CHristoph Hellwig.
4) Allow congestion control algorithm selection via routing metrics.
From Daniel Borkmann.
5) Make ipv4 uncached route list per-cpu, from Eric Dumazet.
6) Handle rfs hash collisions more gracefully, also from Eric Dumazet.
7) Add xmit_more support to r8169, e1000, and e1000e drivers. From
Florian Westphal.
8) Transparent Ethernet Bridging support for GRO, from Jesse Gross.
9) Add BPF packet actions to packet scheduler, from Jiri Pirko.
10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer.
11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman
Kwok.
12) More sanely handle out-of-window dupacks, which can result in
serious ACK storms. From Neal Cardwell.
13) Various rhashtable bug fixes and enhancements, from Herbert Xu,
Patrick McHardy, and Thomas Graf.
14) Support xmit_more in be2net, from Sathya Perla.
15) Group Policy extensions for vxlan, from Thomas Graf.
16) Remove Checksum Offload support for vxlan, from Tom Herbert.
17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From
Vlad Yasevich.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits)
crypto: fix af_alg_make_sg() conversion to iov_iter
ipv4: Namespecify TCP PMTU mechanism
i40e: Fix for stats init function call in Rx setup
tcp: don't include Fast Open option in SYN-ACK on pure SYN-data
openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set
ipv6: Make __ipv6_select_ident static
ipv6: Fix fragment id assignment on LE arches.
bridge: Fix inability to add non-vlan fdb entry
net: Mellanox: Delete unnecessary checks before the function call "vunmap"
cxgb4: Add support in cxgb4 to get expansion rom version via ethtool
ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version
net: dsa: Remove redundant phy_attach()
IB/mlx4: Reset flow support for IB kernel ULPs
IB/mlx4: Always use the correct port for mirrored multicast attachments
net/bonding: Fix potential bad memory access during bonding events
tipc: remove tipc_snprintf
tipc: nl compat add noop and remove legacy nl framework
tipc: convert legacy nl stats show to nl compat
tipc: convert legacy nl net id get to nl compat
tipc: convert legacy nl net id set to nl compat
...
Pull trivial tree changes from Jiri Kosina:
"Patches from trivial.git that keep the world turning around.
Mostly documentation and comment fixes, and a two corner-case code
fixes from Alan Cox"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
kexec, Kconfig: spell "architecture" properly
mm: fix cleancache debugfs directory path
blackfin: mach-common: ints-priority: remove unused function
doubletalk: probe failure causes OOPS
ARM: cache-l2x0.c: Make it clear that cache-l2x0 handles L310 cache controller
msdos_fs.h: fix 'fields' in comment
scsi: aic7xxx: fix comment
ARM: l2c: fix comment
ibmraid: fix writeable attribute with no store method
dynamic_debug: fix comment
doc: usbmon: fix spelling s/unpriviledged/unprivileged/
x86: init_mem_mapping(): use capital BIOS in comment
Pull live patching infrastructure from Jiri Kosina:
"Let me provide a bit of history first, before describing what is in
this pile.
Originally, there was kSplice as a standalone project that implemented
stop_machine()-based patching for the linux kernel. This project got
later acquired, and the current owner is providing live patching as a
proprietary service, without any intentions to have their
implementation merged.
Then, due to rising user/customer demand, both Red Hat and SUSE
started working on their own implementation (not knowing about each
other), and announced first versions roughly at the same time [1] [2].
The principle difference between the two solutions is how they are
making sure that the patching is performed in a consistent way when it
comes to different execution threads with respect to the semantic
nature of the change that is being introduced.
In a nutshell, kPatch is issuing stop_machine(), then looking at
stacks of all existing processess, and if it decides that the system
is in a state that can be patched safely, it proceeds insterting code
redirection machinery to the patched functions.
On the other hand, kGraft provides a per-thread consistency during one
single pass of a process through the kernel and performs a lazy
contignuous migration of threads from "unpatched" universe to the
"patched" one at safe checkpoints.
If interested in a more detailed discussion about the consistency
models and its possible combinations, please see the thread that
evolved around [3].
It pretty quickly became obvious to the interested parties that it's
absolutely impractical in this case to have several isolated solutions
for one task to co-exist in the kernel. During a dedicated Live
Kernel Patching track at LPC in Dusseldorf, all the interested parties
sat together and came up with a joint aproach that would work for both
distro vendors. Steven Rostedt took notes [4] from this meeting.
And the foundation for that aproach is what's present in this pull
request.
It provides a basic infrastructure for function "live patching" (i.e.
code redirection), including API for kernel modules containing the
actual patches, and API/ABI for userspace to be able to operate on the
patches (look up what patches are applied, enable/disable them, etc).
It's relatively simple and minimalistic, as it's making use of
existing kernel infrastructure (namely ftrace) as much as possible.
It's also self-contained, in a sense that it doesn't hook itself in
any other kernel subsystem (it doesn't even touch any other code).
It's now implemented for x86 only as a reference architecture, but
support for powerpc, s390 and arm is already in the works (adding
arch-specific support basically boils down to teaching ftrace about
regs-saving).
Once this common infrastructure gets merged, both Red Hat and SUSE
have agreed to immediately start porting their current solutions on
top of this, abandoning their out-of-tree code. The plan basically is
that each patch will be marked by flag(s) that would indicate which
consistency model it is willing to use (again, the details have been
sketched out already in the thread at [3]).
Before this happens, the current codebase can be used to patch a large
group of secruity/stability problems the patches for which are not too
complex (in a sense that they don't introduce non-trivial change of
function's return value semantics, they don't change layout of data
structures, etc) -- this corresponds to LEAVE_FUNCTION &&
SWITCH_FUNCTION semantics described at [3].
This tree has been in linux-next since December.
[1] https://lkml.org/lkml/2014/4/30/477
[2] https://lkml.org/lkml/2014/7/14/857
[3] https://lkml.org/lkml/2014/11/7/354
[4] http://linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_LivePatching.txt
[ The core code is introduced by the three commits authored by Seth
Jennings, which got a lot of changes incorporated during numerous
respins and reviews of the initial implementation. All the followup
commits have materialized only after public tree has been created,
so they were not folded into initial three commits so that the
public tree doesn't get rebased ]"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing newline to error message
livepatch: rename config to CONFIG_LIVEPATCH
livepatch: fix uninitialized return value
livepatch: support for repatching a function
livepatch: enforce patch stacking semantics
livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING
livepatch: fix deferred module patching order
livepatch: handle ancient compilers with more grace
livepatch: kconfig: use bool instead of boolean
livepatch: samples: fix usage example comments
livepatch: MAINTAINERS: add git tree location
livepatch: use FTRACE_OPS_FL_IPMODIFY
livepatch: move x86 specific ftrace handler code to arch/x86
livepatch: samples: add sample live patching module
livepatch: kernel: add support for live patching
livepatch: kernel: add TAINT_LIVEPATCH
Pull HID updates from Jiri Kosina:
"Updates for HID code
- improveements of Logitech HID++ procotol implementation, from
Benjamin Tissoires
- support for composite RMI devices, from Andrew Duggan
- new driver for BETOP controller, from Huang Bo
- fixup for conflicting mapping in HID core between PC-101/103/104
and PC-102/105 keyboards from David Herrmann
- new hardware support and fixes in Wacom driver, from Ping Cheng
- assorted small fixes and device ID additions all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (33 commits)
HID: wacom: add support for Cintiq 27QHD and 27QHD touch
HID: wacom: consolidate input capability settings for pen and touch
HID: wacom: make sure touch arbitration is applied consistently
HID: pidff: Fix initialisation forMicrosoft Sidewinder FF Pro 2
HID: hyperv: match wait_for_completion_timeout return type
HID: wacom: Report ABS_MISC event for Cintiq Companion Hybrid
HID: Use Kbuild idiom in Makefiles
HID: do not bind to Microchip Pick16F1454
HID: hid-lg4ff: use DEVICE_ATTR_RW macro
HID: hid-lg4ff: fix sysfs attribute permission
HID: wacom: peport In Range event according to the spec
HID: wacom: process invalid Cintiq and Intuos data in wacom_intuos_inout()
HID: rmi: Add support for the touchpad in the Razer Blade 14 laptop
HID: rmi: Support touchpads with external buttons
HID: rmi: Use hid_report_len to compute the size of reports
HID: logitech-hidpp: store the name of the device in struct hidpp
HID: microsoft: add support for Japanese Surface Type Cover 3
HID: fixup the conflicting keyboard mappings quirk
HID: apple: fix battery support for the 2009 ANSI wireless keyboard
HID: fix Kconfig text
...
Merge misc updates from Andrew Morton:
"Bite-sized chunks this time, to avoid the MTA ratelimiting woes.
- fs/notify updates
- ocfs2
- some of MM"
That laconic "some MM" is mainly the removal of remap_file_pages(),
which is a big simplification of the VM, and which gets rid of a *lot*
of random cruft and special cases because we no longer support the
non-linear mappings that it used.
From a user interface perspective, nothing has changed, because the
remap_file_pages() syscall still exists, it's just done by emulating the
old behavior by creating a lot of individual small mappings instead of
one non-linear one.
The emulation is slower than the old "native" non-linear mappings, but
nobody really uses or cares about remap_file_pages(), and simplifying
the VM is a big advantage.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits)
memcg: zap memcg_slab_caches and memcg_slab_mutex
memcg: zap memcg_name argument of memcg_create_kmem_cache
memcg: zap __memcg_{charge,uncharge}_slab
mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
mm: hugetlb: fix type of hugetlb_treat_as_movable variable
mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
mm: memory: merge shared-writable dirtying branches in do_wp_page()
mm: memory: remove ->vm_file check on shared writable vmas
xtensa: drop _PAGE_FILE and pte_file()-related helpers
x86: drop _PAGE_FILE and pte_file()-related helpers
unicore32: drop pte_file()-related helpers
um: drop _PAGE_FILE and pte_file()-related helpers
tile: drop pte_file()-related helpers
sparc: drop pte_file()-related helpers
sh: drop _PAGE_FILE and pte_file()-related helpers
score: drop _PAGE_FILE and pte_file()-related helpers
s390: drop pte_file()-related helpers
parisc: drop _PAGE_FILE and pte_file()-related helpers
openrisc: drop _PAGE_FILE and pte_file()-related helpers
nios2: drop _PAGE_FILE and pte_file()-related helpers
...
Pull quota interface unification and misc cleanups from Jan Kara:
"The first part of the series unifying XFS and VFS quota interfaces.
This part unifies turning quotas on and off so quota-tools and
xfs_quota can be used to manage any filesystem. This is useful so
that userspace doesn't have to distinguish which filesystem it is
working with. As a result we can then easily reuse tests for project
quotas in XFS for ext4.
This also contains minor cleanups and fixes for udf, isofs, and ext3"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (23 commits)
udf: remove bool assignment to 0/1
udf: use bool for done
quota: Store maximum space limit in bytes
quota: Remove quota_on_meta callback
ocfs2: Use generic helpers for quotaon and quotaoff
ext4: Use generic helpers for quotaon and quotaoff
quota: Add ->quota_{enable,disable} callbacks for VFS quotas
quota: Wire up ->quota_{enable,disable} callbacks into Q_QUOTA{ON,OFF}
quota: Split ->set_xstate callback into two
xfs: Remove some pointless quota checks
xfs: Remove some useless flags tests
xfs: Remove useless test
quota: Verify flags passed to Q_SETINFO
quota: Cleanup flags definitions
ocfs2: Move OLQF_CLEAN flag out of generic quota flags
quota: Don't store flags for v2 quota format
jbd: drop jbd_ENOSYS debug
udf: destroy sbi mutex in put_super
udf: Check length of extended attributes and allocation descriptors
udf: Remove repeated loads blocksize
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU1MYmAAoJEAAOaEEZVoIV/rAQAKoHj/PCOATTy05lF/NDhJlS
6NbNjupnC8HrbNPv6Z/cQ902eC1YRVH96gf6we4FeAm9Tjctpje6uEqvPQCUxpot
2jWgCG+g95OeEaQEjXQvR3x5ZfXvPUtwKVOnMF423L1p5Xfbj3kJfGi+dv2k8XOi
GArsUB7uCwqLyyz+L47RJ2Cz7s47M9O25HkVRfWlgYOv+4afq5OpADGKQAhMLL/s
CPhYgqw/7r1p+pLkjUE/x+5BAliDzUinFtDatgD4CeHOdq0RKlxzQ1rFg6uJVg/k
3ZttGOxWUtGIeGM4v5cosDFReLPCESax/TUzn58jxxFR702MjHAA+lHRgjZoWvW/
9EnShl0XlznQX1ns6f0rI1seWe4M5R3CWus8AcG0kDmdbTp8nARo+pBLFhCME/kZ
15GHLz4tDSRt5SNow6aqJdlYJR7p3WrsceKyM5aH9M7odM3eaB5vJxIJ0fljsZbS
Qtz4t+Ua1oVSYD7TX3y7EUiQVPVo8VKS3o6Ua73wCHIXNbSH7hZLOvPLFs6V1Psi
RKqRiad5iO3+iavVGuDDcs12zXZ5hmksE8oMh0NkjFZ6wJlO4Hf5iOt5thABNDmT
Km+40IBq1DYwclPTofaRpB+ytDOnWedMxdWfWdEWQ710zuuNY3cfi/XMXEX34kBY
fLhUMabqcyfUegpA6S0R
=6+UV
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.20-1' of git://git.samba.org/jlayton/linux
Pull file locking related changes #1 from Jeff Layton:
"This patchset contains a fairly major overhaul of how file locks are
tracked within the inode. Rather than a single list, we now create a
per-inode "lock context" that contains individual lists for the file
locks, and a new dedicated spinlock for them.
There are changes in other trees that are based on top of this set so
it may be easiest to pull this in early"
* tag 'locks-v3.20-1' of git://git.samba.org/jlayton/linux:
locks: update comments that refer to inode->i_flock
locks: consolidate NULL i_flctx checks in locks_remove_file
locks: keep a count of locks on the flctx lists
locks: clean up the lm_change prototype
locks: add a dedicated spinlock to protect i_flctx lists
locks: remove i_flock field from struct inode
locks: convert lease handling to file_lock_context
locks: convert posix locks to file_lock_context
locks: move flock locks to file_lock_context
ceph: move spinlocking into ceph_encode_locks_to_buffer and ceph_count_locks
locks: add a new struct file_locking_context pointer to struct inode
locks: have locks_release_file use flock_lock_file to release generic flock locks
locks: add new struct list_head to struct file_lock
- Rework of the core ACPI resources parsing code to fix issues
in it and make using resource offsets more convenient and
consolidation of some resource-handing code in a couple of places
that have grown analagous data structures and code to cover the
the same gap in the core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus,
Jarkko Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states
to make the code more straightforward and less bloated overall
(Rafael J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki,
Yaowei Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in
the right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist,
Pavel Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJU2neOAAoJEILEb/54YlRx51QP/jrv1Wb5eMaemzMksPIWI5Zn
I8IbxzToxu7wDDsrTBRv+LuyllMPrnppFOHHvB35gUYu7Y6I066s3ErwuqeFlbmy
+VicmyGMahv3yN74qg49MXzWtaJZa8hrFXn8ItujiUIcs08yELi0vBQFlZImIbTB
PdQngO88VfiOVjDvmKkYUU//9Sc9LCU0ZcdUQXSnA1oNOxuUHjiARz98R03hhSqu
BWR+7M0uaFbu6XeK+BExMXJTpKicIBZ1GAF6hWrS8V4aYg+hH1cwjf2neDAzZkcU
UkXieJlLJrCq+ZBNcy7WEhkWQkqJNWei5WYiy6eoQeQpNoliY2V+2OtSMJaKqDye
PIiMwXstyDc5rgyULN0d1UUzY6mbcUt2rOL0VN2bsFVIJ1HWCq8mr8qq689pQUYv
tcH18VQ2/6r2zW28sTO/ByWLYomklD/Y6bw2onMhGx3Knl0D8xYJKapVnTGhr5eY
d4k41ybHSWNKfXsZxdJc+RxndhPwj9rFLfvY/CZEhLcW+2pAiMarRDOPXDoUI7/l
aJpmPzy/6mPXGBnTfr6jKDSY3gXNazRIvfPbAdiGayKcHcdRM4glbSbNH0/h1Iq6
HKa8v9Fx87k1X5r4ZbhiPdABWlxuKDiM7725rfGpvjlWC3GNFOq7YTVMOuuBA225
Mu9PRZbOsZsnyNkixBpX
=zZER
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"We have a few new features this time, including a new SFI-based
cpufreq driver, a new devfreq driver for Tegra Activity Monitor, a new
devfreq class for providing its governors with raw utilization data
and a new ACPI driver for AMD SoCs.
Still, the majority of changes here are reworks of existing code to
make it more straightforward or to prepare it for implementing new
features on top of it. The primary example is the rework of ACPI
resources handling from Jiang Liu, Thomas Gleixner and Lv Zheng with
support for IOAPIC hotplug implemented on top of it, but there is
quite a number of changes of this kind in the cpufreq core, ACPICA,
ACPI EC driver, ACPI processor driver and the generic power domains
core code too.
The most active developer is Viresh Kumar with his cpufreq changes.
Specifics:
- Rework of the core ACPI resources parsing code to fix issues in it
and make using resource offsets more convenient and consolidation
of some resource-handing code in a couple of places that have grown
analagous data structures and code to cover the the same gap in the
core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus, Jarkko
Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states to
make the code more straightforward and less bloated overall (Rafael
J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki, Yaowei
Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in the
right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist, Pavel
Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan)"
* tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (151 commits)
tools/power turbostat: relax dependency on APERF_MSR
tools/power turbostat: relax dependency on invariant TSC
Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources
tools/power turbostat: decode MSR_*_PERF_LIMIT_REASONS
tools/power turbostat: relax dependency on root permission
ACPI / video: Add disable_native_backlight quirk for Samsung 510R
ACPI / PM: Remove unneeded nested #ifdef
USB / PM: Remove unneeded #ifdef and associated dead code
intel_pstate: provide option to only use intel_pstate with HWP
ACPI / EC: Add GPE reference counting debugging messages
ACPI / EC: Add query flushing support
ACPI / EC: Refine command storm prevention support
ACPI / EC: Add command flushing support.
ACPI / EC: Introduce STARTED/STOPPED flags to replace BLOCKED flag
ACPI: add AMD ACPI2Platform device support for x86 system
ACPI / table: remove duplicate NULL check for the handler of acpi_table_parse()
ACPI / EC: Update revision due to raw handler mode.
ACPI / EC: Reduce ec_poll() by referencing the last register access timestamp.
ACPI / EC: Fix several GPE handling issues by deploying ACPI_GPE_DISPATCH_RAW_HANDLER mode.
ACPICA: Events: Enable APIs to allow interrupt/polling adaptive request based GPE handling model
...
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of passing the name of the memory cgroup which the cache is
created for in the memcg_name_argument, let's obtain it immediately in
memcg_create_kmem_cache.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are simple wrappers around memcg_{charge,uncharge}_kmem, so let's
zap them and call these functions directly.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One bit in ->vm_flags is unused now!
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removing vma->shared.nonlinear we have only one member of
vma->shared union, which doesn't make much sense.
This patch drops the union and move struct vma->shared.linear to
vma->shared.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't create non-linear mappings anymore. Let's drop code which
handles them in rmap.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't create non-linear mappings anymore. Let's drop code which
handles them on page fault.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have remap_file_pages(2) emulation in -mm tree for few release cycles
and we plan to have it mainline in v3.20. This patchset removes rest of
VM_NONLINEAR infrastructure.
Patches 1-8 take care about generic code. They are pretty
straight-forward and can be applied without other of patches.
Rest patches removes pte_file()-related stuff from architecture-specific
code. It usually frees up one bit in non-present pte. I've tried to reuse
that bit for swap offset, where I was able to figure out how to do that.
For obvious reason I cannot test all that arch-specific code and would
like to see acks from maintainers.
In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial
kernel code. That's too much for functionality nobody uses.
Tested-by: Felipe Balbi <balbi@ti.com>
This patch (of 38):
We don't create non-linear mappings anymore. Let's drop code which
handles them on unmap/zap.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
compound_head() is implemented with assumption that there would be race
condition when checking tail flag. This assumption is only true when we
try to access arbitrary positioned struct page.
The situation that virt_to_head_page() is called is different case. We
call virt_to_head_page() only in the range of allocated pages, so there
is no race condition on tail flag. In this case, we don't need to
handle race condition and we can reduce overhead slightly. This patch
implements compound_head_fast() which is similar with compound_head()
except tail flag race handling. And then, virt_to_head_page() uses this
optimized function to improve performance.
I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free, (14.063
ns -> 13.810 ns) if target object is on tail page.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e9fd702a58 ("audit: convert audit watches to use fsnotify
instead of inotify") broke handling of renames in audit. Audit code
wants to update inode number of an inode corresponding to watched name
in a directory. When something gets renamed into a directory to a
watched name, inotify previously passed moved inode to audit code
however new fsnotify code passes directory inode where the change
happened. That confuses audit and it starts watching parent directory
instead of a file in a directory.
This can be observed for example by doing:
cd /tmp
touch foo bar
auditctl -w /tmp/foo
touch foo
mv bar foo
touch foo
In audit log we see events like:
type=CONFIG_CHANGE msg=audit(1423563584.155:90): auid=1000 ses=2 op="updated rules" path="/tmp/foo" key=(null) list=4 res=1
...
type=PATH msg=audit(1423563584.155:91): item=2 name="bar" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
type=PATH msg=audit(1423563584.155:91): item=3 name="foo" inode=1046842 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
type=PATH msg=audit(1423563584.155:91): item=4 name="foo" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=CREATE
...
and that's it - we see event for the first touch after creating the
audit rule, we see events for rename but we don't see any event for the
last touch. However we start seeing events for unrelated stuff
happening in /tmp.
Fix the problem by passing moved inode as data in the FS_MOVED_FROM and
FS_MOVED_TO events instead of the directory where the change happens.
This doesn't introduce any new problems because noone besides
audit_watch.c cares about the passed value:
fs/notify/fanotify/fanotify.c cares only about FSNOTIFY_EVENT_PATH events.
fs/notify/dnotify/dnotify.c doesn't care about passed 'data' value at all.
fs/notify/inotify/inotify_fsnotify.c uses 'data' only for FSNOTIFY_EVENT_PATH.
kernel/audit_tree.c doesn't care about passed 'data' at all.
kernel/audit_watch.c expects moved inode as 'data'.
Fixes: e9fd702a58 ("audit: convert audit watches to use fsnotify instead of inotify")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Reworked handling for foreign (grant mapped) pages to simplify the
code, enable a number of additional use cases and fix a number of
long-standing bugs.
- Prefer the TSC over the Xen PV clock when dom0 (and the TSC is
stable).
- Assorted other cleanup and minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJU2JC+AAoJEFxbo/MsZsTRIvAH/1lgQ0EQlxaZtEFWY8cJBzxY
dXaTMfyGQOddGYDCW0r42hhXJHeX7DWXSERSD3aW9DZOn/eYdneHq9gWRD4uPrGn
hEFQ26J4jZWR5riGXaja0LqI2gJKLZ6BhHIQciLEbY+jw4ynkNBLNRPFehuwrCsZ
WdBwJkyvXC3RErekncRl/aNhxdi4p1P6qeiaW/mo3UcSO/CFSKybOLwT65iePazg
XuY9UiTn2+qcRkm/tjx8K9heHK8SBEGNWuoTcWYF1to8mwwUfKIAc4NO2UBDXJI+
rp7Z2lVFdII15JsQ08ATh3t7xDrMWLzCX/y4jCzmF3DBXLbSWdHCQMgI7TWt5pE=
=PyJK
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen features and fixes from David Vrabel:
- Reworked handling for foreign (grant mapped) pages to simplify the
code, enable a number of additional use cases and fix a number of
long-standing bugs.
- Prefer the TSC over the Xen PV clock when dom0 (and the TSC is
stable).
- Assorted other cleanup and minor bug fixes.
* tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (25 commits)
xen/manage: Fix USB interaction issues when resuming
xenbus: Add proper handling of XS_ERROR from Xenbus for transactions.
xen/gntdev: provide find_special_page VMA operation
xen/gntdev: mark userspace PTEs as special on x86 PV guests
xen-blkback: safely unmap grants in case they are still in use
xen/gntdev: safely unmap grants in case they are still in use
xen/gntdev: convert priv->lock to a mutex
xen/grant-table: add a mechanism to safely unmap pages that are in use
xen-netback: use foreign page information from the pages themselves
xen: mark grant mapped pages as foreign
xen/grant-table: add helpers for allocating pages
x86/xen: require ballooned pages for grant maps
xen: remove scratch frames for ballooned pages and m2p override
xen/grant-table: pre-populate kernel unmap ops for xen_gnttab_unmap_refs()
mm: add 'foreign' alias for the 'pinned' page flag
mm: provide a find_special_page vma operation
x86/xen: cleanup arch/x86/xen/mmu.c
x86/xen: add some __init annotations in arch/x86/xen/mmu.c
x86/xen: add some __init and static annotations in arch/x86/xen/setup.c
x86/xen: use correct types for addresses in arch/x86/xen/setup.c
...
We no longer use it outside of blk-mq.c, so we can make it static
and stop exporting it. Additionally, kill the 'async' argument, as
there's only one used of it.
Signed-off-by: Jens Axboe <axboe@fb.com>
Userspace can opt to receive a device request notification,
indicating that the device should be released. This is setup
the same way as the error IRQ and also supports eventfd signaling.
Future support may forcefully remove the device from the user if
the request is ignored.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When a request is made to unbind a device from a vfio bus driver,
we need to wait for the device to become unused, ie. for userspace
to release the device. However, we have a long standing TODO in
the code to do something proactive to make that happen. To enable
this, we add a request callback on the vfio bus driver struct,
which is intended to signal the user through the vfio device
interface to release the device. Instead of passively waiting for
the device to become unused, we can now pester the user to give
it up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* pm-cpufreq: (46 commits)
intel_pstate: provide option to only use intel_pstate with HWP
cpufreq-dt: Drop unnecessary check before cpufreq_cooling_unregister() invocation
cpufreq: Create for_each_governor()
cpufreq: Create for_each_policy()
cpufreq: Drop cpufreq_disabled() check from cpufreq_cpu_{get|put}()
cpufreq: Set cpufreq_cpu_data to NULL before putting kobject
intel_pstate: honor user space min_perf_pct override on resume
intel_pstate: respect cpufreq policy request
intel_pstate: Add num_pstates to sysfs
intel_pstate: expose turbo range to sysfs
intel_pstate: Add support for SkyLake
cpufreq: stats: drop unnecessary locking
cpufreq: stats: don't update stats on false notifiers
cpufreq: stats: don't update stats from show_trans_table()
cpufreq: stats: time_in_state can't be NULL in cpufreq_stats_update()
cpufreq: stats: create sysfs group once we are ready
cpufreq: remove CPUFREQ_UPDATE_POLICY_CPU notifications
cpufreq: stats: drop 'cpu' field of struct cpufreq_stats
cpufreq: Remove (now) unused 'last_cpu' from struct cpufreq_policy
cpufreq: stats: rename 'struct cpufreq_stats' objects as 'stats'
...
* pm-domains:
PM: Convert dev_pm_put_subsys_data() into a void function
PM: Update function header for dev_pm_get_subsys_data()
PM / Domains: Handle errors from genpd's ->attach_dev() callback
PM / Domains: Re-order initialization of generic_pm_domain_data
PM / Domains: Free pm_subsys_data in error path in __pm_genpd_add_device()
PM / Domains: Eliminate the mutex for the generic_pm_domain_data
PM / Domains: Don't check for an existing device when adding a new
PM / Domains: Don't allow an existing generic_pm_domain_data
PM / Domains: Remove reference counting for the generic_pm_domain_data
PM / Domains: Rename __pm_genpd_alloc|free_dev_data()
PM / Domains: Remove pm_genpd_dev_need_restore() API
* acpi-resources: (23 commits)
Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources
x86/irq, ACPI: Implement ACPI driver to support IOAPIC hotplug
ACPI: Add interfaces to parse IOAPIC ID for IOAPIC hotplug
x86/PCI: Refine the way to release PCI IRQ resources
x86/PCI/ACPI: Use common ACPI resource interfaces to simplify implementation
x86/PCI: Fix the range check for IO resources
PCI: Use common resource list management code instead of private implementation
resources: Move struct resource_list_entry from ACPI into resource core
ACPI: Introduce helper function acpi_dev_filter_resource_type()
ACPI: Add field offset to struct resource_list_entry
ACPI: Translate resource into master side address for bridge window resources
ACPI: Return translation offset when parsing ACPI address space resources
ACPI: Enforce stricter checks for address space descriptors
ACPI: Set flag IORESOURCE_UNSET for unassigned resources
ACPI: Normalize return value of resource parser functions
ACPI: Fix a bug in parsing ACPI Memory24 resource
ACPI: Add prefetch decoding to the address space parser
ACPI: Move the window flag logic to the combined parser
ACPI: Unify the parsing of address_space and ext_address_space
ACPI: Let the parser return false for disabled resources
...
* acpica:
ACPICA: Events: Enable APIs to allow interrupt/polling adaptive request based GPE handling model
ACPICA: Events: Introduce acpi_set_gpe()/acpi_finish_gpe() to reduce divergences
ACPICA: Events: Introduce ACPI_GPE_DISPATCH_RAW_HANDLER to fix 2 issues for the current GPE APIs
ACPICA: Update version to 20150204
ACPICA: Update Copyright headers to 2015
ACPICA: Hardware: Cast GPE enable_mask before storing
ACPICA: Events: Cleanup GPE dispatcher type obtaining code
ACPICA: Events: Cleanup to move acpi_gbl_global_event_handler invocation out of acpi_ev_gpe_dispatch()
ACPICA: Events: Cleanup of resetting the GPE handler to NULL before removing
ACPICA: Events: Fix uninitialized variable
ACPICA: Events: Remove acpi_ev_valid_gpe_event() due to current restriction
ACPICA: Events: Remove duplicated sanity check in acpi_ev_enable_gpe()
ACPICA: Events: Back port "ACPICA: Save current masks of enabled GPEs after enable register writes"
ACPICA: Resources: Provide common part for struct acpi_resource_address structures.
ACPI: Introduce acpi_unload_parent_table() usages in Linux kernel
ACPICA: take ACPI_MTX_INTERPRETER in acpi_unload_table_id()
Pull libata changes from Tejun Heo:
"Mostly driver-specific changes. Nothing too noteworthy.
This pull request contains three merges from for-3.19-fixes. The
first two are to pull ahci_xgene and sata_dwc_460ex fix commits which
are depended upon by later changes. The last one is to pull in a fix
patch which missed the v3.19-rc window"
* 'for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (24 commits)
ahci_xgene: Fix the dma state machine lockup for the ATA_CMD_SMART PIO mode command.
ata: libahci: Use of_platform_device_create only if supported
sata_mv: Delete unnecessary checks before the function call "phy_power_off"
ata: Delete unnecessary checks before the function call "pci_dev_put"
ata: pata_platform: fix owner module reference mismatch for scsi host
ata: ahci_platform: fix owner module reference mismatch for scsi host
pata_pdc2027x: Use 64-bit timekeeping
ata: libahci: Fix devres cleanup on failure
ata: libahci: Allow using multiple regulators
Documentation: bindings: Add the regulator property to the sub-nodes AHCI bindings
ata: libahci: Clean-up the ahci_platform_en/disable_phys functions
sata_rcar: extend PM methods
sata_dwc_460ex: disable compilation on ARM and ARM64
ata: libata-core: Remove unused function
sata_dwc_460ex: convert to devm_kzalloc in ->probe()
sata_dwc_460ex: remove extra message
sata_dwc_460ex: use np local variable in ->probe()
sata_dwc_460ex: fix most of the sparse warnings
sata_dwc_460ex: enable COMPILE_TEST for the driver
sata_dwc_460ex: remove redundant dev_set_drvdata
...
Pull cgroup changes from Tejun Heo:
"Three mostly trivial patches. The biggest change is that blkio is now
initialized before memcg which will be needed to make memcg and blkcg
cooperate on writeback IOs"
* 'for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add dummy css_put() for !CONFIG_CGROUPS
cgroup: reorder SUBSYS(blkio) in cgroup_subsys.h
Update of Documentation/cgroups/00-INDEX
Pull percpu changes from Tejun Heo:
"Nothing too interesting. One cleanup patch and another to add a
trivial state check function"
* 'for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu_ref: implement percpu_ref_is_dying()
percpu_ref: remove unnecessary ACCESS_ONCE() in percpu_ref_tryget_live()
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.
Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull EFI updates from Ingo Molnar:
"Main changes:
- Move efivarfs from the misc filesystem section to pseudo filesystem
- Expose firmware platform size in sysfs
- Improve robustness of get_memory_map() by removing assumptions on
the size of efi_memory_desc_t.
- various cleanups and fixes
The biggest risk is the get_memory_map() change, which changes the way
that both the arm64 and x86 EFI boot stub build the early memory map.
There are no known regressions with it at the moment, BYMMV"
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Don't look for chosen@0 node on DT platforms
firmware: efi: Remove unneeded guid unparse
efi/libstub: Call get_memory_map() to obtain map and desc sizes
efi: Small leak on error in runtime map code
efi: rtc-efi: Mark UIE as unsupported
arm64/efi: efistub: Apply __init annotation
efi: Expose underlying UEFI firmware platform size to userland
efi: Rename efi_guid_unparse to efi_guid_to_str
efi: Update the URLs for efibootmgr
fs: Make efivarfs a pseudo filesystem, built by default with EFI
Pull x86 APIC updates from Ingo Molnar:
"Continued fallout of the conversion of the x86 IRQ code to the
hierarchical irqdomain framework: more cleanups, simplifications,
memory allocation behavior enhancements, mainly in the interrupt
remapping and APIC code"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
x86, init: Fix UP boot regression on x86_64
iommu/amd: Fix irq remapping detection logic
x86/acpi: Make acpi_[un]register_gsi_ioapic() depend on CONFIG_X86_LOCAL_APIC
x86: Consolidate boot cpu timer setup
x86/apic: Reuse apic_bsp_setup() for UP APIC setup
x86/smpboot: Sanitize uniprocessor init
x86/smpboot: Move apic init code to apic.c
init: Get rid of x86isms
x86/apic: Move apic_init_uniprocessor code
x86/smpboot: Cleanup ioapic handling
x86/apic: Sanitize ioapic handling
x86/ioapic: Add proper checks to setp/enable_IO_APIC()
x86/ioapic: Provide stub functions for IOAPIC%3Dn
x86/smpboot: Move smpboot inlines to code
x86/x2apic: Use state information for disable
x86/x2apic: Split enable and setup function
x86/x2apic: Disable x2apic from nox2apic setup
x86/x2apic: Add proper state tracking
x86/x2apic: Clarify remapping mode for x2apic enablement
x86/x2apic: Move code in conditional region
...
Pull timer updates from Ingo Molnar:
"The main changes in this cycle were:
- rework hrtimer expiry calculation in hrtimer_interrupt(): the
previous code had a subtle bug where expiry caching would miss an
expiry, resulting in occasional bogus (late) expiry of hrtimers.
- continuing Y2038 fixes
- ktime division optimization
- misc smaller fixes and cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Make __hrtimer_get_next_event() static
rtc: Convert rtc_set_ntp_time() to use timespec64
rtc: Remove redundant rtc_valid_tm() from rtc_hctosys()
rtc: Modify rtc_hctosys() to address y2038 issues
rtc: Update rtc-dev to use y2038-safe time interfaces
rtc: Update interface.c to use y2038-safe time interfaces
time: Expose get_monotonic_boottime64 for in-kernel use
time: Expose getboottime64 for in-kernel uses
ktime: Optimize ktime_divns for constant divisors
hrtimer: Prevent stale expiry time in hrtimer_interrupt()
ktime.h: Introduce ktime_ms_delta
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- various sched/deadline fixes and enhancements
- rescheduling latency fixes/cleanups
- rework the rq->clock code to be more consistent and more robust.
- minor micro-optimizations
- ->avg.decay_count fixes
- add a stack overflow check to might_sleep()
- idle-poll handler fix, possibly resulting in power savings
- misc smaller updates and fixes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/Documentation: Remove unneeded word
sched/wait: Introduce wait_on_bit_timeout()
sched: Pull resched loop to __schedule() callers
sched/deadline: Remove cpu_active_mask from cpudl_find()
sched: Fix hrtick_start() on UP
sched/deadline: Avoid pointless __setscheduler()
sched/deadline: Fix stale yield state
sched/deadline: Fix hrtick for a non-leftmost task
sched/deadline: Modify cpudl::free_cpus to reflect rd->online
sched/idle: Add missing checks to the exit condition of cpu_idle_poll()
sched: Fix missing preemption opportunity
sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
sched/debug: Print rq->clock_task
sched/core: Rework rq->clock update skips
sched/core: Validate rq_clock*() serialization
sched/core: Remove check of p->sched_class
sched/fair: Fix sched_entity::avg::decay_count initialization
sched/debug: Fix potential call to __ffs(0) in sched_show_task()
sched/debug: Check for stack overflow in ___might_sleep()
sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- AMD range breakpoints support:
Extend breakpoint tools and core to support address range through
perf event with initial backend support for AMD extended
breakpoints.
The syntax is:
perf record -e mem:addr/len:type
For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512)
perf record -e mem:0x1000/512:w
- event throttling/rotating fixes
- various event group handling fixes, cleanups and general paranoia
code to be more robust against bugs in the future.
- kernel stack overhead fixes
User-visible tooling side changes:
- Show precise number of samples in at the end of a 'record' session,
if processing build ids, since we will then traverse the whole
perf.data file and see all the PERF_RECORD_SAMPLE records,
otherwise stop showing the previous off-base heuristicly counted
number of "samples" (Namhyung Kim).
- Support to read compressed module from build-id cache (Namhyung
Kim)
- Enable sampling loads and stores simultaneously in 'perf mem'
(Stephane Eranian)
- 'perf diff' output improvements (Namhyung Kim)
- Fix error reporting for evsel pgfault constructor (Arnaldo Carvalho
de Melo)
Tooling side infrastructure changes:
- Cache eh/debug frame offset for dwarf unwind (Namhyung Kim)
- Support parsing parameterized events (Cody P Schafer)
- Add support for IP address formats in libtraceevent (David Ahern)
Plus other misc fixes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
perf: Decouple unthrottling and rotating
perf: Drop module reference on event init failure
perf: Use POLLIN instead of POLL_IN for perf poll data in flag
perf: Fix put_event() ctx lock
perf: Fix move_group() order
perf: Fix event->ctx locking
perf: Add a bit of paranoia
perf symbols: Convert lseek + read to pread
perf tools: Use perf_data_file__fd() consistently
perf symbols: Support to read compressed module from build-id cache
perf evsel: Set attr.task bit for a tracking event
perf header: Set header version correctly
perf record: Show precise number of samples
perf tools: Do not use __perf_session__process_events() directly
perf callchain: Cache eh/debug frame offset for dwarf unwind
perf tools: Provide stub for missing pthread_attr_setaffinity_np
perf evsel: Don't rely on malloc working for sz 0
tools lib traceevent: Add support for IP address formats
perf ui/tui: Show fatal error message only if exists
perf tests: Fix typo in sample-parsing.c
...
Pull core locking updates from Ingo Molnar:
"The main changes are:
- mutex, completions and rtmutex micro-optimizations
- lock debugging fix
- various cleanups in the MCS and the futex code"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Optimize setting task running after being blocked
locking/rwsem: Use task->state helpers
sched/completion: Add lock-free checking of the blocking case
sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state
locking/mutex: Explicitly mark task as running after wakeup
futex: Fix argument handling in futex_lock_pi() calls
doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants
locking/Documentation: Update code path
softirq/preempt: Add missing current->preempt_disable_ip update
locking/osq: No need for load/acquire when acquire-polling
locking/mcs: Better differentiate between MCS variants
locking/mutex: Introduce ww_mutex_set_context_slowpath()
locking/mutex: Move MCS related comments to proper location
locking/mutex: Checking the stamp is WW only
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle are:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
rcu: Initialize tiny RCU stall-warning timeouts at boot
rcu: Fix RCU CPU stall detection in tiny implementation
rcu: Add GP-kthread-starvation checks to CPU stall warnings
rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
rcu: Optionally run grace-period kthreads at real-time priority
ksoftirqd: Use new cond_resched_rcu_qs() function
ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
rcutorture: Add more diagnostics in rcu_barrier() test failure case
torture: Flag console.log file to prevent holdovers from earlier runs
torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
rcutorture: Handle different mpstat versions
rcutorture: Check from beginning to end of grace period
rcu: Remove redundant rcu_batches_completed() declaration
rcutorture: Drop rcu_torture_completed() and friends
rcu: Provide rcu_batches_completed_sched() for TINY_RCU
rcutorture: Use unsigned for Reader Batch computations
rcutorture: Make build-output parsing correctly flag RCU's warnings
rcu: Make _batches_completed() functions return unsigned long
rcutorture: Issue warnings on close calls due to Reader Batch blows
documentation: Fix smp typo in memory-barriers.txt
...
Make __ipv6_select_ident() static as it isn't used outside
the file.
Fixes: 0508c07f5e (ipv6: Select fragment id during UFO segmentation if not set.)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renamed the reserved1 member of struct ethtool_drvinfo to erom_version to get
expansion/option ROM version of the adapter if present.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver exposes interfaces that directly relate to HW state. Upon fatal
error, consumers of these interfaces (ULPs) that rely on completion of
all their posted work-request could hang, thereby introducing dependencies
in shutdown order. To prevent this from happening, we manage the
relevant resources (CQs, QPs) that are used by the device. Upon a fatal error,
we now generate simulated completions for outstanding WQEs that were not
completed at the time the HW was reset.
It includes invoking the completion event handler for all involved CQs so that
the ULPs will poll those CQs. When polled we return simulated CQEs with
IB_WC_WR_FLUSH_ERR return code enabling ULPs to clean up their resources and
not wait forever for completions upon receiving remove_one.
The above change requires an extra check in the data path to make sure that when
device is in error state, the simulated CQEs will be returned and no further
WQEs will be posted.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When queuing work to send the NETDEV_BONDING_INFO netdev event, it's
possible that when the work is executed, the pointer to the slave
becomes invalid. This can happen if between queuing the event and the
execution of the work, the net-device was un-ensvaled and re-enslaved.
Fix that by queuing a work with the data of the slave instead of the
slave structure.
Fixes: 69e6113343 ('net/bonding: Notify state change on slaves')
Reported-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This has not been a busy release for the regulator framework, though we
do have the first parts of some ongoing work from Bjorn Andersson to
allow us to support more complex modern systems with dynamic
configuration of regulators in suspend and idle states.
- Support for device-specific properties on regulator nodes when using
simplified DT parsing in the core from Krzysztof Kozlowski.
- Restructuring of the load tracking code, intended to support future
improvements in this area for more complex system designs.
- New drivers for Maxim MAX77843 and Mediatek MT6397.
- Lots of smaller fixes and improvements.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU2GajAAoJECTWi3JdVIfQa/IH/i2Ktdxq6Y5tu1CvUqmGlA+P
tgpL0CAKn5rH3rM4DnAZ5MFNKZ2QgYpEmdkqcO3qunT7V9luQDBIh1CILUDicCla
T+jcHSt2xypnteScxDUOYlzUpjoHhx2Y2L8qneQoYVgMCevbWx8BhSKNSwOnkJrb
1Tl0s3k3CKdU261aIBSdzZ/bipRV0QFG5jSnc70PXrwX+dRd+utKfvXIC1M4pi3E
AbRWY4G5YMrBPviNoTJRpaiZkvOaTwd6riETZWSSQfs3JQ8dpGCfS00gJHgk0A4E
+2cUIhikhKKkDNkNysVRGPZr4D9q3Fnhek9plzQmwHdSb7w7+j4wmB/p6bRR/ME=
=XPki
-----END PGP SIGNATURE-----
Merge tag 'regulator-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"This has not been a busy release for the regulator framework, though
we do have the first parts of some ongoing work from Bjorn Andersson
to allow us to support more complex modern systems with dynamic
configuration of regulators in suspend and idle states.
- Support for device-specific properties on regulator nodes when
using simplified DT parsing in the core from Krzysztof Kozlowski.
- Restructuring of the load tracking code, intended to support future
improvements in this area for more complex system designs.
- New drivers for Maxim MAX77843 and Mediatek MT6397.
- Lots of smaller fixes and improvements"
* tag 'regulator-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (29 commits)
regulator: max77843: Add max77843 regulator driver
regulator: Fix build breakage on !REGULATOR
regulator: Build sysfs entries with static attribute groups
regulator: qcom-rpm: Make it possible to specify supply
regulator: core: Consolidate drms update handling
regulator: qcom-rpm: signedness bug in probe()
regulator: da9211: Add gpio control for enable/disable of buck
regulator: qcom_rpm: Don't update vreg->uV/mV if rpm_reg_write fails
regulator: lp872x: Remove **regulators from struct lp872x
regulator: da9211: fix unmatched of_node
regulator: Update documentation after renaming function argument
regulator: axp20x: Migrate to regulator core's simplified DT parsing code
regulator: axp20x: Fill regulators_node and of_match descriptor fields
regulator: pfuze100-regulator: add pfuze3000 support
regulator: max77686: Document gpio properties
regulator: Allow parsing custom properties when using simplified DT parsing
regulator: max77686: Add GPIO control
regulator: Copy config passed during registration
regulator: tps65023: Constify struct regmap_config and regulator_ops
regulator: max8649: Constify struct regmap_config and regulator_ops
...
The major highlight this release is a refactoring of the core to allow
us to run synchronous transfers in the context of the caller when there
is no contention for the bus. This improves performance in the very
common case by eliminating context switches and reducing the number of
hardware setup and teardown operations we need to perform.
Other changes:
- New drivers for DLN-2 USB-SPI adapter and ST SPI controllers.
- A big round of cleanups, performance and feature improvements
for the xilinx driver from Ricardo Ribalda Delgado.
- A wide range of smaller cleanups, fixes and feature improvements
throughout the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU2GNgAAoJECTWi3JdVIfQLiYH/0uLN43CunPp0gSWllQ2PY1O
R1QiqXg1fr1uZKRuGy59QF0TkU/JlWPY+tpGiOH1jrnDsoecnWsxDx3YEeuYdV6U
c//UrlK2uvESivbc48zVUTwCsgxsE8apG0JgqLjsfUpqZTEFxFpeSskepSJ2kIUz
bsXHU8Xi0WkLalsk/8Ik8aUvOwVi5EtRE9OMvnU6QPqQMCszgv1TH4UbwbhqwwzZ
U23WbNHQ262XDRwY2LKl/QROULeU5pd9F19wrveKMa42fkbu/e+kk6E3n7/Hd4mV
CUjv1wTCpPZvzh3bTk50uXwA9XQOzv6ddw6jqsgLcV6jS8Ju3Z3Beya3fmdhOl0=
=3ZQr
-----END PGP SIGNATURE-----
Merge tag 'spi-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"The major highlight this release is a refactoring of the core to allow
us to run synchronous transfers in the context of the caller when
there is no contention for the bus. This improves performance in the
very common case by eliminating context switches and reducing the
number of hardware setup and teardown operations we need to perform.
Other changes:
- New drivers for DLN-2 USB-SPI adapter and ST SPI controllers.
- A big round of cleanups, performance and feature improvements for
the xilinx driver from Ricardo Ribalda Delgado.
- A wide range of smaller cleanups, fixes and feature improvements
throughout the subsystem"
* tag 'spi-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (68 commits)
spi: mxs: cleanup wait_for_completion return handling
spi: ti-qspi: cleanup wait_for_completion return handling
spi: spi-imx: cleanup wait_for_completion handling
spi: sh-msiof: cleanup wait_for_completion return handling
spi: match var type to return type of wait_for_completion
spi: spi-pxa2xx: only include mach/dma.h for legacy DMA
spi: atmel: cleanup wait_for_completion return handling
spi: fsl-dspi: Remove possible memory leak of 'chip'
spi: sh-msiof: Update calculation of frequency dividing
spi: spidev: Convert buf pointers for 32-bit compat SPI_IOC_MESSAGE(n)
spi/xilinx: Fix access invalid memory on xilinx_spi_tx
spi: Revert "spi/xilinx: Remove iowrite/ioread wrappers"
spi/xilinx: Check number of slaves range
spi/xilinx: Use polling mode on small transfers
spi/xilinx: Remove remaining_words driver data variable
spi/xilinx: Remove iowrite/ioread wrappers
spi/xilinx: Convert bits_per_word in bytes_per_word
spi/xilinx: Convert remainding_bytes in remaining words
spi/xilinx: Make spi_tx and spi_rx simmetric
spi/xilinx: Remove rx_fn and tx_fn pointer
...
Add functionality for safely appending string data to a TLV without
keeping write count in the caller.
Convert TIPC_CMD_SHOW_LINK_STATS to compat dumpit.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a framework for transcoding legacy nl action into actions
(.doit) calls from the new nl API. This is done by converting the
incoming TLV data into netlink data with nested netlink attributes.
Unfortunately due to the randomness of the legacy API we can't do this
generically so each legacy netlink command requires a specific
transcoding recipe. In this case for bearer enable and bearer disable.
Convert TIPC_CMD_ENABLE_BEARER and TIPC_CMD_DISABLE_BEARER into doit
compat calls.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a framework for dumping netlink data from the new netlink
API and formatting it to the old legacy API format. This is done by
looping the dump data and calling a format handler for each entity, in
this case a bearer.
We dump until either all data is dumped or we reach the limited buffer
size of the legacy API. Remember, the legacy API doesn't scale.
In this commit we convert TIPC_CMD_GET_BEARER_NAMES to use the compat
layer.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A very quiet release for regmap this time around:
- Fix an endianness issue for I2C devices connected via SMBus where
we were getting two layers both trying to do endianness handling.
- Use a union to reduce the size of the regmap struct.
- A couple of smaller fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU2GDZAAoJECTWi3JdVIfQId0H/3dsylUYN0uYzDA1fKy0At5E
7dmKbWkp1Ab62wQYC59a5KH1OgdZe/mWf3x9kxEZlqcmS15OPEuqs7g+BNoAi3Fj
ysmIbKRpJIoZs+XrgBGPCunchLq6rOmtcVaQEUQZZnKg31kQCBhMoPip1QCT5Qn7
UGoYgwP3O7TPVxJDZXA3gCGZER/Jr4Bhrkb3qJKzofpoRvzqSeqKuDJdwOTy7dzx
w1xMQWUlrPuCKHe3HA5uD4ZPHtNsN23APd7spmu/HhqXtN1nCIxh5/0z4qKVGvLO
h3xSCTtOTJZI2kYEFDn3/6DFWQokp90fwBKaG8EtixtD/8M9OQA7tuW1mtsUgOE=
=nPfp
-----END PGP SIGNATURE-----
Merge tag 'regmap-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"A very quiet release for regmap this time around:
- Fix an endianness issue for I2C devices connected via SMBus where
we were getting two layers both trying to do endianness handling.
- Use a union to reduce the size of the regmap struct.
- A couple of smaller fixes"
* tag 'regmap-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: Fix i2c word access when using SMBus access functions
regmap: Export regmap_get_val_endian
regmap: ac97: Clean up indentation
regmap: correct the description of structure element in reg_field
regmap: Move spinlock_flags into the union
iwlwifi:
* more work for new devices (4165 / 8260)
* cleanups / improvemnts in rate control
* fixes for TDLS
* major statistics work from Johannes - more to come
* improvements for the fw error dump infrastructure
* usual amount of small fixes here and there (scan, D0i3 etc...)
* add support for beamforming
* enable stuck queue detection for iwlmvm
* a few fixes for EBS scan
* fixes for various failure paths
* improvements for TDLS Offchannel
wil6210:
* performance tuning
* some AP features
brcm80211:
* rework some code in SDIO part of the brcmfmac driver related to
suspend/resume that were found doing stress testing
* in PCIe part scheduling of worker thread needed to be relaxed
* minor fixes and exposing firmware revision information to
user-space, ie. ethtool.
mwifiex:
* enhancements for change virtual interface handling
* remove coupling between netdev and FW supported interface
combination, now conversion from any type of supported interface
types to any other type is possible
* DFS support in AP mode
ath9k:
* fix calibration issues on some boards
* Wake-on-WLAN improvements
ath10k:
* add support for qca6174 hardware
* enable RX batching to reduce CPU load
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJU1fUoAAoJEG4XJFUm622bnjMH/25f2kyLAJSLJmiIhpEcYlNJ
CJMAYsSTcqhKaMEbx742StUUDH3pa1oV4mQ5csVa2QJIsS90WzPpB1eRnNtQ69Nj
89Zfwa6bbX0TXDgw1Aa28NewPY/3xWpCjI03HBQaMIncToWv3/dzNUr0bEmIuBds
wsr5Y+fy80VKnkoXG7XzGFOqmxxFwNS+UF3M1WNtQ+xcr9rK3/LLxyWFy84S9UIe
4lVOb+df91YFIJeLs28/hfTiRhvV0fWIbupGv8UhuBMho+F0dpLvra+3xksEqALu
4+AzI0j9GjqFfMzbmRBcPcT4Rl37nLmvQ+7XokrFgAYS5zp7OGylOMU3nHyvkuY=
=iL8v
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-for-davem-2015-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Major changes:
iwlwifi:
* more work for new devices (4165 / 8260)
* cleanups / improvemnts in rate control
* fixes for TDLS
* major statistics work from Johannes - more to come
* improvements for the fw error dump infrastructure
* usual amount of small fixes here and there (scan, D0i3 etc...)
* add support for beamforming
* enable stuck queue detection for iwlmvm
* a few fixes for EBS scan
* fixes for various failure paths
* improvements for TDLS Offchannel
wil6210:
* performance tuning
* some AP features
brcm80211:
* rework some code in SDIO part of the brcmfmac driver related to
suspend/resume that were found doing stress testing
* in PCIe part scheduling of worker thread needed to be relaxed
* minor fixes and exposing firmware revision information to
user-space, ie. ethtool.
mwifiex:
* enhancements for change virtual interface handling
* remove coupling between netdev and FW supported interface
combination, now conversion from any type of supported interface
types to any other type is possible
* DFS support in AP mode
ath9k:
* fix calibration issues on some boards
* Wake-on-WLAN improvements
ath10k:
* add support for qca6174 hardware
* enable RX batching to reduce CPU load
Conflicts:
drivers/net/wireless/rtlwifi/pci.c
Conflict resolution is to get rid of the 'end' label and keep
the rest.
Signed-off-by: David S. Miller <davem@davemloft.net>
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type. Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).
Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path. As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
non-blk-mq devices -- the preallocated clone request is passed in
from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
blk-mq devices -- blk_get_request() and blk_put_request() are used
respectively from these hooks.
dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set. This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.
[This patch was started by Keith and later heavily modified by Mike]
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Don't use generic snapshot of trigger_tstamp if low-level driver or
hardware can get a more precise value for better audio/system time
synchronization.
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
09c32aaa36 ("ahci_xgene: Fix the dma state machine lockup for the
ATA_CMD_SMART PIO mode command.") missed 3.19 release. Fold it into
for-3.20.
Signed-off-by: Tejun Heo <tj@kernel.org>
Make sure root user does not try something stupid.
Also make sure mask field in struct rps_sock_flow_table
does not share a cache line with the potentially often dirtied
flow table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 567e4b7973 ("net: rfs: add hash collision detection")
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead we rely on SO_REUSEPORT to provide the reconnection semantics
that we need for NFSv2/v3.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The socket lock is currently held by the task that is requesting the
connection be established. While that is efficient in the case where
the connection happens quickly, it is racy in the case where it doesn't.
What we really want is for the connect helper to be able to block access
to the socket while it is being set up.
This patch does so by arranging to transfer the socket lock from the
task that is requesting the connect attempt, and then releasing that
lock once everything is done.
This scheme also gives us automatic protection against collisions with
the RPC close code, so we can kill the cancel_delayed_work_sync()
call in xs_close().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
out a real bug. During suspend and resume the tlb_flush tracepoint is
called when the CPU is going offline. As the CPU has been noted as offline,
RCU is ignoring that CPU, which means that it can not use RCU protected
locks. When tracepoints are activated, they require RCU locking, and
if RCU is ignoring a CPU that runs a tracepoint, there is a chance that
the tracepoint could cause corruption.
The solution was to change the tracepoint into a TRACE_EVENT_CONDITION()
which allows us to check a condition to determine if the tracepoint
should be called or not. If the condition is not met, the rcu protected
code will not be executed. By adding the condition
"cpu_online(smp_processor_id())", this will prevent the RCU protected
code from being executed if the CPU is marked offline.
After adding this, another bug was discovered. As RCU checks rcu callers,
if a rcu call is not done, there is no check (obviously). We found that
tracepoints could be added in RCU ignored locations and not have lockdep
complain until the tracepoint is activated. This missed places where
tracepoints were added in places they should not have been. To fix this,
code was added in 3.18 that if lockdep is enabled, any tracepoint will
still call the rcu checks even if the tracepoint is not enabled. The bug
here, is that the check does not take the CONDITION into account. As the
condition may prevent tracepoints from being activated in RCU ignored
areas (as the one patch does), we get false positives when we enable
lockdep and hit a tracepoint that the condition prevents it from being
called in a RCU ignored location. The fix for this is to add the
CONDITION to the rcu checks, even if the tracepoint is not enabled.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU1rQfAAoJEEjnJuOKh9ld19UH/juFLZFjpYBgtRmbCZa/54Zk
i2Fa3U8jQe8MHHEYOjCLT9MQTYo/42btmJhr7kWKtIoUgDEli4lkOpbs+H0qar5y
Vv9+1cLeNFQzgIE3nwV7cjAw7Jufoyzd1lstDqIQvcmzZnQ5sNyyVeigMcxGv8Ls
4FyqzG6zCVgiDL4LyYNHdNcMr6qLs3KTFDEqp+kQreeO7R1r3ZEpq3JoWaEUgoPP
qrYv/rqVosLBUGA0pd7RmiGOxhjeKm15qz1GkiPeeus6DDWC6bvPC8cAc/FfkXH0
hYpoQghSZVnXGy0LzVsd44gj7tYx1FHEpYy1s8G6d5WJcNOGZ6OoZOdOZMyjPVw=
=PeL2
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fixes from Steven Rostedt:
"During testing Sedat Dilek hit a "suspicious RCU usage" splat that
pointed out a real bug. During suspend and resume the tlb_flush
tracepoint is called when the CPU is going offline. As the CPU has
been noted as offline, RCU is ignoring that CPU, which means that it
can not use RCU protected locks. When tracepoints are activated, they
require RCU locking, and if RCU is ignoring a CPU that runs a
tracepoint, there is a chance that the tracepoint could cause
corruption.
The solution was to change the tracepoint into a
TRACE_EVENT_CONDITION() which allows us to check a condition to
determine if the tracepoint should be called or not. If the condition
is not met, the rcu protected code will not be executed. By adding
the condition "cpu_online(smp_processor_id())", this will prevent the
RCU protected code from being executed if the CPU is marked offline.
After adding this, another bug was discovered. As RCU checks rcu
callers, if a rcu call is not done, there is no check (obviously). We
found that tracepoints could be added in RCU ignored locations and not
have lockdep complain until the tracepoint is activated. This missed
places where tracepoints were added in places they should not have
been. To fix this, code was added in 3.18 that if lockdep is enabled,
any tracepoint will still call the rcu checks even if the tracepoint
is not enabled. The bug here, is that the check does not take the
CONDITION into account. As the condition may prevent tracepoints from
being activated in RCU ignored areas (as the one patch does), we get
false positives when we enable lockdep and hit a tracepoint that the
condition prevents it from being called in a RCU ignored location.
The fix for this is to add the CONDITION to the rcu checks, even if
the tracepoint is not enabled"
* tag 'trace-fixes-v3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
x86/tlb/trace: Do not trace on CPU that is offline
tracing: Add condition check to RCU lockdep checks
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
represented by a tcp_timewait_sock, we rate limit dupacks in response
to incoming packets (a) with TCP timestamps that fail PAWS checks, or
(b) with sequence numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state ESTABLISHED, where the connection is represented
by a tcp_sock, we rate limit dupacks in response to incoming packets
(a) with TCP timestamps that fail PAWS checks, or (b) with sequence
numbers or ACK numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
There is already a similar (although global) rate-limiting mechanism
for "challenge ACKs". When deciding whether to send a challence ACK,
we first consult the new per-connection rate limit, and then the
global rate limit.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the SYN_RECV state, where the TCP connection is represented by
tcp_request_sock, we now rate-limit SYNACKs in response to a client's
retransmitted SYNs: we do not send a SYNACK in response to client SYN
if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
since we last sent a SYNACK in response to a client's retransmitted
SYN.
This allows the vast majority of legitimate client connections to
proceed unimpeded, even for the most aggressive platforms, iOS and
MacOS, which actually retransmit SYNs 1-second intervals for several
times in a row. They use SYN RTO timeouts following the progression:
1,1,1,1,1,2,4,8,16,32.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
OVS userspace already probes the openvswitch kernel module for
OVS_ACTION_ATTR_SET_MASKED support. This patch adds the kernel module
implementation of masked set actions.
The existing set action sets many fields at once. When only a subset
of the IP header fields, for example, should be modified, all the IP
fields need to be exact matched so that the other field values can be
copied to the set action. A masked set action allows modification of
an arbitrary subset of the supported header bits without requiring the
rest to be matched.
Masked set action is now supported for all writeable key types, except
for the tunnel key. The set tunnel action is an exception as any
input tunnel info is cleared before action processing starts, so there
is no tunnel info to mask.
The kernel module converts all (non-tunnel) set actions to masked set
actions. This makes action processing more uniform, and results in
less branching and duplicating the action processing code. When
returning actions to userspace, the fully masked set actions are
converted back to normal set actions. We use a kernel internal action
code to be able to tell the userspace provided and converted masked
set actions apart.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU06ozAAoJEIqAPN1PVmxKdtkP/2CeQ0+xJnJWgyh4xjkVxfPb
wPYrz3cHeHdnVX36mnIdRoQZRjxh/Ef/vgK8RsXohcldzHA9D8GjxRpj0GrkVGQ9
xoDkiHsFDdsZHeezSrlnXNnObvZrkeMsmnUDJxfwLtcX1nzFKzBgW2GaDiNjUazC
r5Ip4YIfoCaVLq5MSqYRqJ6uhplXd/ePdtPoo54L/E81y4ptQi8pUsQBy4K3GrFn
FbXoemcymhi9AVBGl+OlppZ93t6bdOV1D5tcOwpytAbfa3jp2oUnJPZay7EoWruR
aj8l5v3P+SSphAJwaGSbny0L83tTS7sq+N4QG+VxxdMkw2YjpmxgN3AGguNBtPGG
SUThpZV6QhrkAOnwzP2BJnh68WSU1eJrGvk8zUWW0SanA8k2jaHO/VzXCA3/ZkC4
IFcXl4Jr6R3iUxDFgS/6MB5E9EGedzFTacsWoHVyZPtaDAQp4WVdAIRQ4QKgJCLw
1yRmCeVunTsrs6O0aenU1xHinPlnx+tkuiNh4H64APQYTz54WwXH1/S04OL1SkqI
mTzUvFgWqJvSIOuU19g0HBw+xZ/9vvX8LLkm9kSTbe7tcmFH5CI7ZCN/hNDHvMSd
sjNYOyag/SAHJ01tTKcSPAgWO49bHWPodI04zP0lK0gMLjKvRknVMdTgIDfu49HN
2da9E23iBmNNgG79iVHN
=8caD
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-3.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
NFC: 3.20 second pull request
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
Signed-off-by: David S. Miller <davem@davemloft.net>
The system suspend calls (used to synchronize between system suspend of
the CPU and it's PMIC) are called from board code but not stubbed when
regulator is disabled causing build failures in such cases. This patch
fixes those build failures.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU0orTAAoJECTWi3JdVIfQnjEH/0dF1gfaCbZJidzPnsZWexm8
sofDP5Lmzd/oym2M5LT06i4FGL/1jg08X8SEqGmZK84F/HKvTgZneHyusF952WAj
Rf4HZ+QgVpItJH0Try+O6Ww+QEzwDjrFhJcmTwhX9vOuHKBRFG+K4d8U3HaIWNu/
j5nggRzXKO4jcuPFBOda7CoiHb4bk8YBVrmmm0uA41/RKM1c80+sTLASpb4JIoL2
VOcktUGpfIwtZKTZAX+xsbDDYQ+4fmVApRlldCrJqJVMHcUsPjFV5PUIatUs2ZVE
+t9kafuLyXpA8ZNHQA22TyOD8UXMn/Iyc9Xkp+EieoFamCErTdnBjk1Im3g3BAA=
=yXkZ
-----END PGP SIGNATURE-----
Merge tag 'regulator-v3.19-rc7' into regulator-linus
regulator: Fix !REGULATOR builds of systems using system suspend calls
The system suspend calls (used to synchronize between system suspend of
the CPU and it's PMIC) are called from board code but not stubbed when
regulator is disabled causing build failures in such cases. This patch
fixes those build failures.
# gpg: Signature made Thu 05 Feb 2015 05:10:43 HKT using RSA key ID 5D5487D0
# gpg: WARNING: digest algorithm MD5 is deprecated
# gpg: please see http://www.gnupg.org/faq/weak-digest-algos.html for more information
# gpg: Oops: keyid_from_fingerprint: no pubkey
# gpg: key AF88CD16: no public key for trusted key - skipped
# gpg: key AF88CD16 marked as ultimately trusted
# gpg: key 5621E907: no public key for trusted key - skipped
# gpg: key 5621E907 marked as ultimately trusted
# gpg: Good signature from "Mark Brown <broonie@sirena.org.uk>"
# gpg: aka "Mark Brown <broonie@debian.org>"
# gpg: aka "Mark Brown <broonie@kernel.org>"
# gpg: aka "Mark Brown <broonie@tardis.ed.ac.uk>"
# gpg: aka "Mark Brown <broonie@linaro.org>"
# gpg: aka "Mark Brown <Mark.Brown@linaro.org>"
When taking a CPU down for suspend and resume, a tracepoint may be called
when the CPU has been designated offline. As tracepoints require RCU for
protection, they must not be called if the current CPU is offline.
Unfortunately, trace_tlb_flush() is called in this scenario as was noted
by LOCKDEP:
...
Disabling non-boot CPUs ...
intel_pstate CPU 1 exiting
===============================
smpboot: CPU 1 didn't die...
[ INFO: suspicious RCU usage. ]
3.19.0-rc7-next-20150204.1-iniza-small #1 Not tainted
-------------------------------
include/trace/events/tlb.h:35 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 0
no locks held by swapper/1/0.
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc7-next-20150204.1-iniza-small #1
Hardware name: SAMSUNG ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
0000000000000001 ffff88011a44fe18 ffffffff817e370d 0000000000000011
ffff88011a448290 ffff88011a44fe48 ffffffff810d6847 ffff8800c66b9600
0000000000000001 ffff88011a44c000 ffffffff81cb3900 ffff88011a44fe78
Call Trace:
[<ffffffff817e370d>] dump_stack+0x4c/0x65
[<ffffffff810d6847>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff810b71a5>] idle_task_exit+0x205/0x2c0
[<ffffffff81054c4e>] play_dead_common+0xe/0x50
[<ffffffff81054ca5>] native_play_dead+0x15/0x140
[<ffffffff8102963f>] arch_cpu_idle_dead+0xf/0x20
[<ffffffff810cd89e>] cpu_startup_entry+0x37e/0x580
[<ffffffff81053e20>] start_secondary+0x140/0x150
intel_pstate CPU 2 exiting
...
By converting the tlb_flush tracepoint to a TRACE_EVENT_CONDITION where the
condition is cpu_online(smp_processor_id()), we can avoid calling RCU protected
code when the CPU is offline.
Link: http://lkml.kernel.org/r/CA+icZUUGiGDoL5NU8RuxKzFjoLjEKRtUWx=JB8B9a0EQv-eGzQ@mail.gmail.com
Cc: stable@vger.kernel.org # 3.17+
Fixes: d17d8f9ded "x86/mm: Add tracepoints for TLB flushes"
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_tlb_flush() tracepoint can be called when a CPU is going offline.
When a CPU is offline, RCU is no longer watching that CPU and since the
tracepoint is protected by RCU, it must not be called. To prevent the
tlb_flush tracepoint from being called when the CPU is offline, it was
converted to a TRACE_EVENT_CONDITION where the condition checks if the
CPU is online before calling the tracepoint.
Unfortunately, this was not enough to stop lockdep from complaining about
it. Even though the RCU protected code of the tracepoint will never be
called, the condition is hidden within the tracepoint, and even though the
condition prevents RCU code from being called, the lockdep checks are
outside the tracepoint (this is to test tracepoints even when they are not
enabled).
Even though tracepoints should be checked to be RCU safe when they are not
enabled, the condition should still be considered when checking RCU.
Link: http://lkml.kernel.org/r/CA+icZUUGiGDoL5NU8RuxKzFjoLjEKRtUWx=JB8B9a0EQv-eGzQ@mail.gmail.com
Fixes: 3a630178fd "tracing: generate RCU warnings even when tracepoints are disabled"
Cc: stable@vger.kernel.org # 3.18+
Acked-by: Dave Hansen <dave@sr71.net>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Yann realized that the previous revert of new userspace ABI did not
go far enough, and we're still exposing a change that we don't want.
Revert even closer to 3.18 interface to make sure we get things right
in the long run.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJU1S+NAAoJEENa44ZhAt0hbq0P/0EhrI204nNULnK7UUV8nUAk
E5lN45Er0uorsX4dcX4nM+YinNSmlrV7uX5B7Nn9/B6MyTvT02Ws4WDjHnvjrhbN
3oCuOhwT4lFldAPFIobYMoK0zZVpEffzXOgZrnabEK1PuqI4jSw/wNupGAnvRyxs
t+m3qhKgRDpQRylIC3HT9zamf5n/r0ouehzTJvT5jLTpcxIW/CTyO6JVn1YWRMNC
Weh3cFhMpZXnuxMo02Jtvd9SIJXoI+oD5hws1/oomNZYXiqSG//u5g9FMZ2jkdhI
KJmuuTfp7SG7f7XCElguEbsSXTO0z3Bfq+OjH/9JH4FYOS2AUK2zxC9ftGzF7hky
/qxT0LKEvBUbNLEihFaWRs9O0W40nrcCldRtpeX/0ChSNVQxhO72O4WbSi/rzAyD
VCzlSkEI9Dz3XESxG5C7DuY057wWVnVPWByrjoo4EL3C/A/PGwGQH1vTXeChkWHQ
pV0FAJTRWvUtWWWJMyma+ZjmTn8upguj5eYXYkH28uLfAs2IOhwDxQBEN/+8nIQ5
iT4DvqM26FQ0kh9YJYOJkSxgWqHhjzb4v0PHm/Gx3KNiO1mwN8zQNsI7Karwmdt1
L/LpexwYfwWB8BJHAcKnuvSAVAhyYzDSzjwUJp26iwyhDLHczA5+dofpslK5/Pat
0N2HI9xdAmstONTNNcZz
=5lkC
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull one more infiniband revert from Roland Dreier:
"One more last-second RDMA change for 3.19: Yann realized that the
previous revert of new userspace ABI did not go far enough, and we're
still exposing a change that we don't want. Revert even closer to
3.18 interface to make sure we get things right in the long run"
Yann Droneaud pipes up:
"I hope this could go in v3.19 as, at this stage, we don't want to
expose any bits of this ABI in a released kernel"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
Revert "IB/core: Add support for extended query device caps"
This is the last missing piece to get a kernel booting to a prompt in qemu-cris.
Signed-off-by: Niklas Cassel <nks@flawful.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix deadline parameter modification handling
sched/wait: Remove might_sleep() from wait_event_cmd()
sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask
sched/fair: Avoid using uninitialized variable in preferred_group_nid()
Hopefully the final pull request for 3.19: this ended up with a
slightly higher volume than wished, but I put them all as they are
either stable or 3.19 regression fixes.
Most of commits are from ASoC, and have been stewed for a while in
linux-next. The only change in the common code is the regression
fixes for ASoC AC97 stuff wrt device registrations. The rest are
device-specific, mostly small fixes in various ASoC drivers and
ak411x on ice1724 boards.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU1IACAAoJEGwxgFQ9KSmk3doP/AvQw++5VviePzYisnwWqiVF
zZ4hLTHXKyVTz00Lvhuf31Xnv7MUBQ1m8ww+b6jaUWKd1ZZRshITzIhYtP8HHE5k
LEDPuY1UeHiu0syidEVTCLZ1kXlNxKZ0F8da0rSl6/yN35l1lmiscPV0bpxXOsZH
wxo8HItzkB8Jl6KrvjX5Ftlu30no6qw475Ud0DmIXLzE9l2J5pcV/TYLiyLrKV1M
VnPvBFx5eCyIe2c9+V+29auQ2pdMmdqod66pnFDPSFPtGUF1cX/UuULRParEQWys
iMG/zMBBIa5H+7HP9DT4hjipZSuFyWM9nNx+l1Hlwr5GoaK/Q0wyunULjLtuE4x5
ScWA89D6okYpwJsiV6VNuzhk8qe7UUzUU/8CT8rtr8vbyJrFmGm9g7a3wmfPz2LV
GNFApYkz5q4eHViqiXE15k8obJRgQH/bXjlhGtZMFSDx1RFgbacZTxsIjlSO6krG
Wwt2mRMLQ/du2T1lIXcRYB6EwxxHFDdy3Yp8VzY3Yrf4Wg7Exyb3V2YrxtCDuONI
4DADJNk4xWII6ApniF94PTjiOkyX1juLvuQKAhVz/scteCOmAZbGaV7fNs4unwhn
nD2unzH+0EcOe77Ej/QS5I7X85R6k3aAQzhZdlQVQ46TdrlFBCvmMMhGJL8EtdXF
IeOnw/QI7KVo3mSE7XyS
=JNb4
-----END PGP SIGNATURE-----
Merge tag 'sound-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Hopefully the final pull request for 3.19: this ended up with a
slightly higher volume than wished, but I put them all as they are
either stable or 3.19 regression fixes.
Most of commits are from ASoC, and have been stewed for a while in
linux-next. The only change in the common code is the regression
fixes for ASoC AC97 stuff wrt device registrations. The rest are
device-specific, mostly small fixes in various ASoC drivers and ak411x
on ice1724 boards"
* tag 'sound-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: Intel: fix sst firmware path for cht-bsw-rt5672
ARM: dts: Fix I2S1, I2S2 compatible for exynos4 SoCs
ASoC: sgtl5000: add delay before first I2C access
MAINTAINERS: ASoC: add maintainer for Intel BDW/HSW ASoC driver
ASoC: atmel_ssc_dai: fix the setting for DSP mode
ASoC: sgtl5000: Use shift mask when setting codec mode
ASoC: tlv320aic3x: Fix data delay configuration
ALSA: ak411x: Fix stall in work callback
ASoC: Intel: Used lock version to update shim registers
ASoC: wm8731: init mutex in i2c init path
ASoC: atmel_ssc_dai: fix start event for I2S mode
ASoC: rt5640: Add RT5642 ACPI ID for Intel Baytrail
ASoC: wm97xx: Reset AC'97 device before registering it
ASoC: Add support for allocating AC'97 device before registering it
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- adds coupled cpuidle support for exynos4210
: fix for Exynos platform PM code preparing it for the coupled
cpuidle support and adds coupled cpuidle AFTR mode on exynos4210
Note this is mostrly based on earlier cpuidle-exynos4210 driver
from Daniel Lezcano and Bart updated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJU0ikrAAoJEA0Cl+kVi2xqLoUQAI2JZi6XyeyJ+0YNGn+TpGEp
OBMks8hVyUXDIk+juuijRLh1MTzOkguMNKjS8y84IHpM8SvVUiPzh/b4pPBr7i14
c6zaocm17HePZwF9tR5VwLAAiyV+0f9rRgUUBSaVTyc5/tMFJES6iuNVzPwa7fqf
5u7YjAuV1N349NC8148wvvrg/ICktpdCydYmfOfOTsjgVNwiyd4kF3ofC7Xt9MrN
fNgdl+OQZxOIWFKkBw3p3+exWybgjd9GE5K9kuASmGWrIOXjRO4KxO/UqRBNJIhH
bVjnkEq88kczf4Z999YHO33fMaMC3y/h++luRgkzxOZge0KVbqusikq+dCc5je3D
k/TPn0tnc8A2nbbcvziuXOe/EhudqXWD+QZXhRcmUAeRFDA1DNsEfSHzujD1fbyn
TA9+RjLodCv2LT5wQq/EHs5yiAYChiUz/TGOBwhHzMUNvXKs3cSBtATabXgWrCtd
pYtI+o1FYekccafF1bUWMIR6BDHxAOb41FDQVZYzi7+ez5yHHYEwZGqPXLmY3x/f
8Nvyy2PgOouB5cHSxn9r0wtMM0nATWB1WNUIf3cPRCHZKpxpHOhrH4CEDsQG02FS
qhs8QARvDA2pCV+m9hFFKDt/VNMpzHnw6uC0XWlICrDk9sHPzwXVuB1MHcjr/n6w
wJhgNOVyfTEnIDQi2mGt
=nadx
-----END PGP SIGNATURE-----
Merge tag 'samsung-cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into next/drivers
Merge "Samsung CPUIdle updates for v3.20" from Kukjin Kim:
- adds coupled cpuidle support for exynos4210
: fix for Exynos platform PM code preparing it for the coupled
cpuidle support and adds coupled cpuidle AFTR mode on exynos4210
Note this is mostrly based on earlier cpuidle-exynos4210 driver
from Daniel Lezcano and Bart updated.
* tag 'samsung-cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
cpuidle: exynos: add coupled cpuidle support for exynos4210
ARM: EXYNOS: apply S5P_CENTRAL_SEQ_OPTION fix only when necessary
Signed-off-by: Olof Johansson <olof@lixom.net>
For assigning sysfs entries for a card device from the driver,
introduce a new helper function, snd_card_add_dev_attr(). In this
way, we can avoid the possible race between the device registration
and the sysfs addition / removal.
The driver can pass a new attribute group to add freely. This has to
be called before snd_card_register().
Currently, up to two extra groups can be added. More than that, it'll
return an error.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
While commit 7e36ef8205 ("IB/core: Temporarily disable
ex_query_device uverb") is correct as it makes the extended
QUERY_DEVICE uverb (which came as part of commit 5a77abf9a9
("IB/core: Add support for extended query device caps") and commit
860f10a799 ("IB/core: Add flags for on demand paging support")) not
available to userspace, it doesn't address the initial issue regarding
ib_copy_to_udata() [1][2].
Additionally, further discussions around this new uverb seems to
conclude it would require a different data structure than the one
currently described in <rdma/ib_user_verbs.h> [3].
Both of these issues require a revert of the changes, so this patch
partially reverts commit 8cdd312cfe ("IB/mlx5: Implement the ODP
capability query verb") and commit 860f10a799 ("IB/core: Add flags
for on demand paging support") and fully reverts commit 5a77abf9a9
("IB/core: Add support for extended query device caps").
[1] "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
http://mid.gmane.org/1418733236.2779.26.camel@opteya.com
[2] "Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb"
http://mid.gmane.org/1423067503.3030.83.camel@opteya.com
[3] "RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask"
http://mid.gmane.org/2807E5FD2F6FDA4886F6618EAC48510E0CC12C30@CRSMSX101.amr.corp.intel.com
Cc: Eli Cohen <eli@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
variant of the rk3288-evb and the setting of a clock for the watchdog.
Also the lcd and hdmi controllers on both the firefly and the evb get
enabled and let us now boot into fbcon console sucessfully.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCAAGBQJUy+f+AAoJEPOmecmc0R2BdFcH/i+9hgt9ElJxKbpPPVPVv81O
QzmIi3g+B4WqAoKBfzkug5l82mSZ4FAvmQK5ad8BYHsBaGl+GRjwwaqRqZkPUOzl
umXk0GneZ5sPK+S1UdA/S/v7hTwgEpXOLnGUCGUXbLt7Gqda99ume08jgCe2JbH5
Tfjoia5uRlOgdZIf3KBr4TkPR+LHog1i/DI9N3XrAsjLtj/igHpoRmFydBmv7noo
OuRuPY0afg1jKs/MextjM0W1JzQbGyv4Jp4NjVFv60+KZrMgS64vTDZamxYEK4Sv
wn0pQUZgevQhd3ewdLtqrcbo4wIGiLWBXQK0vzonmo5tCk8r+59QFKxeMiq2C7Q=
=ZVrD
-----END PGP SIGNATURE-----
Merge tag 'v3.20-rockchip-dts3' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into next/dt
Merge "ARM: rockchip: third (and last) batch of dts updates for 3.20" from
Heiko Stübner:
Change are regulator nodes for the cpu and gpu regulators on the act8846
variant of the rk3288-evb and the setting of a clock for the watchdog.
Also the lcd and hdmi controllers on both the firefly and the evb get
enabled and let us now boot into fbcon console sucessfully.
* tag 'v3.20-rockchip-dts3' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
ARM: dts: rockchip: move the hdmi ddc-i2c-bus property to the actual boards
ARM: dts: rockchip: enable vops and hdmi output on rk3288-firefly and -evb
ARM: dts: rockchip: housekeeping off i2c0 on rk3288-evb boards
ARM: dts: rockchip: add cpu and gpu regulators to rk3288-evb-act8846
ARM: dts: rockchip: add rk3288 watchdog clock
clk: rockchip: add id for watchdog pclk on rk3288
clk: rockchip: add clock IDs for the PVTM clocks
clk: rockchip: add clock ID for usbphy480m_src
Signed-off-by: Olof Johansson <olof@lixom.net>
If we're sending an asynchronous layoutreturn, then we need to ensure
that the inode and the super block remain pinned.
Cc: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
If we're sending an asynchronous layoutcommit, then we need to ensure
that the inode and the super block remain pinned.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
RFC 4429 ("Optimistic DAD") states that optimistic addresses
should be treated as deprecated addresses. From section 2.1:
Unless noted otherwise, components of the IPv6 protocol stack
should treat addresses in the Optimistic state equivalently to
those in the Deprecated state, indicating that the address is
available for use but should not be used if another suitable
address is available.
Optimistic addresses are indeed avoided when other addresses are
available (i.e. at source address selection time), but they have
not heretofore been available for things like explicit bind() and
sendmsg() with struct in6_pktinfo, etc.
This change makes optimistic addresses treated more like
deprecated addresses than tentative ones.
Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
These are rather too large for this late in the release cycle but
they're clear, well understood and have been tested to fix a regression
which was introduced for v3.19. The details are all in Lars' changelog
and they've been cooking in -next for a while, to a large extent out
of conservatism about the size.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU08iuAAoJECTWi3JdVIfQ9cUH/1ZNkoS67zEj/wsvcS+co/Dp
hInq7mAwO/usFfFWMzb68Lofc4g5yTGE+vtTAhggmzMGuqjXEMCAS/tO85EoA65d
2f0YNw7gEhUyp0oYPe5tS3I0F8O7ES+u27PwmM8sGO0KdCc9qrb3ppZN3BkEcr2U
akg854atOcKwBvGRLNSyj79WPkjNyMcM/kWkJCEHFLysu8rSilZ+UaekH+CUw02E
hstLM+PlM0LE/9rKmQvRyx6cnVqKqPWxalG0mCtBwpdHLYMn4yDFc3nn71lMn3oQ
/TIcHSfDCRgx3VVW8daBjNHqLgfg4XCcZ5W646zrpyyrlXRCCPJT8N0SY9jZUU4=
=ScU7
-----END PGP SIGNATURE-----
Merge tag 'asoc-fix-ac97-v3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: AC'97 fixes
These are rather too large for this late in the release cycle but
they're clear, well understood and have been tested to fix a regression
which was introduced for v3.19. The details are all in Lars' changelog
and they've been cooking in -next for a while, to a large extent out
of conservatism about the size.