Commit Graph

64413 Commits

Author SHA1 Message Date
Eric Dumazet
c3f9b01849 tcp: tcp_release_cb() should release socket ownership
Lars Persson reported following deadlock :

-000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
-001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
-002 |ip_local_deliver_finish(skb = 0x8BD527A0)
-003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
-004 |netif_receive_skb(skb = 0x8BD527A0)
-005 |elk_poll(napi = 0x8C770500, budget = 64)
-006 |net_rx_action(?)
-007 |__do_softirq()
-008 |do_softirq()
-009 |local_bh_enable()
-010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
-011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
-012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
-013 |tcp_release_cb(sk = 0x8BE6B2A0)
-014 |release_sock(sk = 0x8BE6B2A0)
-015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
-016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
-017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
-018 |smb_send_kvec()
-019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
-020 |cifs_call_async()
-021 |cifs_async_writev(wdata = 0x87FD6580)
-022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
-023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
-024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
-025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
-026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
-027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
-028 |bdi_writeback_workfn(work = 0x87E4A9CC)
-029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
-030 |worker_thread(__worker = 0x8B045880)
-031 |kthread(_create = 0x87CADD90)
-032 |ret_from_kernel_thread(asm)

Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
it is running from softirq context.

Lars trace involved a NIC without RX checksum support but other points
are problematic as well, like the prequeue stuff.

Problem is triggered by a timer, that found socket being owned by user.

tcp_release_cb() should call tcp_write_timer_handler() or
tcp_delack_timer_handler() in the appropriate context :

BH disabled and socket lock held, but 'owned' field cleared,
as if they were running from timer handlers.

Fixes: 6f458dfb40 ("tcp: improve latencies of timer triggered events")
Reported-by: Lars Persson <lars.persson@axis.com>
Tested-by: Lars Persson <lars.persson@axis.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:45:59 -04:00
Andrew Lutomirski
adca476782 net: Improve SO_TIMESTAMPING documentation and fix a minor code bug
The original documentation was very unclear.

The code fix is presumably related to the formerly unclear
documentation: SOCK_TIMESTAMPING_RX_SOFTWARE has no effect on
__sock_recv_timestamp's behavior, so calling __sock_recv_ts_and_drops
from sock_recv_ts_and_drops if only SOCK_TIMESTAMPING_RX_SOFTWARE is
set is pointless.  This should have no user-observable effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 16:18:01 -05:00
Linus Torvalds
c3bebc71c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix memory leak in ieee80211_prep_connection(), sta_info leaked on
    error.  From Eytan Lifshitz.

 2) Unintentional switch case fallthrough in nft_reject_inet_eval(),
    from Patrick McHardy.

 3) Must check if payload lenth is a power of 2 in
    nft_payload_select_ops(), from Nikolay Aleksandrov.

 4) Fix mis-checksumming in xen-netfront driver, ip_hdr() is not in the
    correct place when we invoke skb_checksum_setup().  From Wei Liu.

 5) TUN driver should not advertise HW vlan offload features in
    vlan_features.  Fix from Fernando Luis Vazquez Cao.

 6) IPV6_VTI needs to select NET_IPV_TUNNEL to avoid build errors, fix
    from Steffen Klassert.

 7) Add missing locking in xfrm_migrade_state_find(), we must hold the
    per-namespace xfrm_state_lock while traversing the lists.  Fix from
    Steffen Klassert.

 8) Missing locking in ath9k driver, access to tid->sched must be done
    under ath_txq_lock().  Fix from Stanislaw Gruszka.

 9) Fix two bugs in TCP fastopen.  First respect the size argument given
    to tcp_sendmsg() in the fastopen path, and secondly prevent
    tcp_send_syn_data() from potentially using order-5 allocations.
    From Eric Dumazet.

10) Fix handling of default neigh garbage collection params, from Jiri
    Pirko.

11) Fix cwnd bloat and over-inflation of RTT when transmit segmentation
    is in use.  From Eric Dumazet.

12) Missing initialization of Realtek r8169 driver's statistics
    seqlocks.  Fix from Kyle McMartin.

13) Fix RTNL assertion failures in 802.3ad and AB ARP monitor of bonding
    driver, from Ding Tianhong.

14) Bonding slave release race can cause divide by zero, fix from
    Nikolay Aleksandrov.

15) Overzealous return from neigh_periodic_work() causes reachability
    time to not be computed.  Fix from Duain Jiong.

16) Fix regression in ipv6_find_hdr(), it should not return -ENOENT when
    a specific target is specified and found.  From Hans Schillstrom.

17) Fix VLAN tag stripping regression in BNA driver, from Ivan Vecera.

18) Tail loss probe can calculate bogus RTTs due to missing packet
    marking on retransmit.  Fix from Yuchung Cheng.

19) We cannot do skb_dst_drop() in iptunnel_pull_header() because
    multicast loopback detection in later code paths need access to
    skb_rtable().  Fix from Xin Long.

20) The macvlan driver regresses in that it propagates lower device
    offload support disables into itself, causing severe slowdowns when
    running over a bridge.  Provide the software offloads always on
    macvlan devices to deal with this and the regression is gone.  From
    Vlad Yasevich.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
  macvlan: Add support for 'always_on' offload features
  net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable
  ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer
  net: cpsw: fix cpdma rx descriptor leak on down interface
  be2net: isolate TX workarounds not applicable to Skyhawk-R
  be2net: Fix skb double free in be_xmit_wrokarounds() failure path
  be2net: clear promiscuous bits in adapter->flags while disabling promiscuous mode
  be2net: Fix to reset transparent vlan tagging
  qlcnic: dcb: a couple off by one bugs
  tcp: fix bogus RTT on special retransmission
  hsr: off by one sanity check in hsr_register_frame_in()
  can: remove CAN FD compatibility for CAN 2.0 sockets
  can: flexcan: factor out soft reset into seperate funtion
  can: flexcan: flexcan_remove(): add missing netif_napi_del()
  can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze
  can: flexcan: factor out transceiver {en,dis}able into seperate functions
  can: flexcan: fix transition from and to low power mode in chip_{en,dis}able
  can: flexcan: flexcan_open(): fix error path if flexcan_chip_start() fails
  can: flexcan: fix shutdown: first disable chip, then all interrupts
  USB AX88179/178A: Support D-Link DUB-1312
  ...
2014-03-04 08:44:32 -08:00
Linus Torvalds
3f803abf2e Merge branch 'akpm' (patches from Andrew Morton)
Merge misc fixes from Andrew Morton.

* emailed patches from Andrew Morton akpm@linux-foundation.org>:
  mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness
  mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
  MAINTAINERS: add and correct types of some "T:" entries
  MAINTAINERS: use tab for separator
  rapidio/tsi721: fix tasklet termination in dma channel release
  hfsplus: fix remount issue
  zram: avoid null access when fail to alloc meta
  sh: prefix sh-specific "CCR" and "CCR2" by "SH_"
  ocfs2: fix quota file corruption
  drivers/rtc/rtc-s3c.c: fix incorrect way of save/restore of S3C2410_TICNT for TYPE_S3C64XX
  kallsyms: fix absolute addresses for kASLR
  scripts/gen_initramfs_list.sh: fix flags for initramfs LZ4 compression
  mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking
  memcg: reparent charges of children before processing parent
  memcg: fix endless loop in __mem_cgroup_iter_next()
  lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlock
  dma debug: account for cachelines and read-only mappings in overlap tracking
  mm: close PageTail race
  MAINTAINERS: EDAC: add Mauro and Borislav as interim patch collectors
2014-03-04 08:29:39 -08:00
Liu Ping Fan
1ae71d0319 mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
When doing some numa tests on powerpc, I triggered an oops bug.  I find
it is caused by using page->_last_cpupid.  It should be initialized as
"-1 & LAST_CPUPID_MASK", but not "-1".  Otherwise, in task_numa_fault(),
we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)).  And
finally cause an oops bug in task_numa_group(), since the online cpu is
less than possible cpu.  This happen with CONFIG_SPARSE_VMEMMAP disabled

Call trace:

  SMP NR_CPUS=64 NUMA PowerNV
  Modules linked in:
  CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32
  task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000
  REGS: c000001e32c53510 TRAP: 0300   Not tainted(3.13.0-rc1+)
  MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR:28024424  XER: 20000000
  CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1
  NIP  .task_numa_fault+0x1470/0x2370
  LR  .task_numa_fault+0x1468/0x2370
  Call Trace:
   .task_numa_fault+0x1468/0x2370 (unreliable)
   .do_numa_page+0x480/0x4a0
   .handle_mm_fault+0x4ec/0xc90
   .do_page_fault+0x3a8/0x890
   handle_page_fault+0x10/0x30
  Instruction dump:
  3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb
  3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4
  ---[ end trace 15f2510da5ae07cf ]---

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:50 -08:00
Vlastimil Babka
9050d7eba4 mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking
Daniel Borkmann reported a VM_BUG_ON assertion failing:

  ------------[ cut here ]------------
  kernel BUG at mm/mlock.c:528!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ccm arc4 iwldvm [...]
   video
  CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
  Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
  task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
  RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
  Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
  RIP   munlock_vma_pages_range+0x2e0/0x2f0
  ---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page.  This can be reproduced with default config since 3.11
kernels.  A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b ("mm:
accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
not trigger a bug.  It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page.  This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522bb4 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable.  The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway.  Related Lkml discussion can be
found in [2].

 [1] tools/testing/selftests/net/psock_tpacket
 [2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Tested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [3.11.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:48 -08:00
David Rientjes
668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Linus Torvalds
7abd42eab3 Clock framework and driver fixes, all of which fix user-visible
regressions. There is a single framework fix that prevents dereferencing
 a NULL pointer when calling clk_get. The range of fixes for clock driver
 regressions spans memory leak fixes, touching the wrong registers that
 cause things to explode, misconfigured clock rates that result in
 non-responsive devices and even some boot failures. The most benign fix
 is DT binding doc typo. It is a stable ABI exposed from the kernel that
 was introduced in -rc1, so best to fix it now.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTEpHGAAoJEDqPOy9afJhJBHUP/3rkQWvFA216EO8zhIQMnJS/
 Xw6S+jufccG2u4arJsVUTJ2n3Xak0J/LNU2MYRHrIc+xLgk5KdntmcmQ2qcEeh46
 YOp9UYx1qACy4FtFG6yjZibmD6sZ9YWJrEn58xeBUXZck3aoV8MDblMmY5xZu0wM
 EOI8HD/RNmLe8SwXfJglmlsUuv5n5IOSw73Dqkjj5J4/mGUh2BqvItDJPfnt23fI
 EgURLYRTnQ8xBYfvV0F+8HOFUqYegRuJO9P9/ykfdzSaJZD+96jnImaaeOr34z7d
 xL/HXQEkK/kft5iF70Hnhcb24j/JVdrC22+TZi16YwIODLeCyaIHBO/UsZ/0f5rY
 MHz1Y4m8ZWnH3RCvfFJ2VJqAzb5UmKbRGS6Sg8SBdx4icBSvvWudIHeql4StTGZ3
 Hgi2WkJeLO+PCwbpECNtDdhtPZV6H3r+AFyllcEx+V8AK0U25vOhOg+2ijJSPrW7
 1gLQX+Yyco+zjcV2umGw/l3pfm/JjpE2IfJxtgMS2C84+/kGGgppPHiZPQzowpKT
 ClPTziZjkzFP4ZE1fCEWGidCyqyjKr69TasIA0gxhI6h4aVvr5sRGhfgtEfMrbE9
 dLFoMeQ7EMmZdMRX0O9IJuXt79yPs4Z5v9oX32C18Qyz7I04E6UBuueRgMJCglZP
 3NoIkEbGeHQdD6YpihEQ
 =zck4
 -----END PGP SIGNATURE-----

Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux

Pull clk framework fixes from Mike Turquette:
 "Clock framework and driver fixes, all of which fix user-visible
  regressions.

  There is a single framework fix that prevents dereferencing a NULL
  pointer when calling clk_get.  The range of fixes for clock driver
  regressions spans memory leak fixes, touching the wrong registers that
  cause things to explode, misconfigured clock rates that result in
  non-responsive devices and even some boot failures.  The most benign
  fix is DT binding doc typo.  It is a stable ABI exposed from the
  kernel that was introduced in -rc1, so best to fix it now"

* tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux: (25 commits)
  clk:at91: Fix memory leak in of_at91_clk_master_setup()
  clk: nomadik: fix multiplatform problem
  clk: Correct handling of NULL clk in __clk_{get, put}
  clk: shmobile: Fix typo in MSTP clock DT bindings
  clk: shmobile: rcar-gen2: Fix qspi divisor
  clk: shmobile: rcar-gen2: Fix clock parent for all non-PLL clocks
  clk: tegra124: remove gr2d and gr3d clocks
  clk: tegra: Fix vic03 mux index
  clk: shmobile: rcar-gen2: Fix qspi divisor
  clk: shmobile: rcar-gen2: Fix clock parent all non-PLL clocks
  clk: tegra: use max divider if divider overflows
  clk: tegra: cclk_lp has a pllx/2 divider
  clk: tegra: fix sdmmc clks on Tegra1x4
  clk: tegra: fix host1x clock on Tegra124
  clk: tegra: PLLD2 fixes for hdmi
  clk: tegra: Fix PLLD mnp table
  clk: tegra: Fix PLLP rate table
  clk: tegra: Correct clock number for UARTE
  clk: tegra: Add missing Tegra20 fuse clks
  ARM: keystone: dts: fix clkvcp3 control register address
  ...
2014-03-03 10:47:46 -08:00
Linus Torvalds
3751c97036 Driver core fix for 3.14-rc5
Here is a single sysfs fix for 3.14-rc5.  It fixes a reported problem
 with the namespace code in sysfs.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlMTq7sACgkQMUfUDdst+yml/wCgkUWPlSGv3UA5AJ1yDBnFqgxB
 RcAAn1CM1x6k3ULHG6Hz7SGkFg9dqpjz
 =aK2B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull sysfs fix from Greg KH:
 "Here is a single sysfs fix for 3.14-rc5.  It fixes a reported problem
  with the namespace code in sysfs"

* tag 'driver-core-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: fix namespace refcnt leak
2014-03-02 15:13:41 -08:00
Linus Torvalds
7aa483554d sound fixes for 3.14-rc5
It's a bad habit to get a higher volume of fixes often lately, but
 things happen again.  All commits found here are real bug fixes,
 and are mostly trivial.  Most of changes in ASoC are the fixes for
 enum items due to the wrong API usages, in addition to a few DAPM
 mutex deadlock and other fixes.  In HD-audio, only fixups for HP
 laptops.  Although diffstat shows much, the changes are simple:
 there are just so many different device entries there.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTD0xBAAoJEGwxgFQ9KSmkFmEQAJf5Jwud7RxFoI3+qPaFmER9
 saztx2g/h9ksM+JVnN4UBbmzuOgd5zZm3uXv3xwe+9FcLWaV/+DTs8jrpmVEkg3B
 nPInV789zbGfaxsa6rh0zMB5RIkd/z135JTxQ6AI1LJStMsQFQsYMK8TZGKx/Svf
 GCF6l9K6FJ9qnNkJfwKmgO3SKvNZic1aflBD/uSfzAG0ZWmVtac0+5oUDsuvfI17
 +gmLC0m0M1bidiIKg1KZXPR+d/SLebCbVng4yabk0e4EH7PrXeZYp/MG2AeBiWOl
 uOXrQ+R5d9GWJgSfXRgnoT2OTGu7HIDlBOzS5PN4VVi5BLXKiD7V92BF0AbPCDf0
 qeSzVZd3dPULqnPKKEjNy7iGwYsIkTmFMNY4ZQ9ISuSGS3vyQH1lzBDj570acEc3
 gHHq2nZnWaltTkkE4jr7bi9qGWTDyfZlZrpefnz+ifZh/NNpvLMQL5BiAGUISK/O
 ywJ2wpwMG/FoD5VJ6cswygOLrrQta0Dn9QJtZD3PuxVlMqwbhzO+Di2lKr/R+QHS
 SlonlaRlfy38uuGIMjSD8/t1h1SkqVFomlOeXxIVStEN1w07NjQFHphov95+tat9
 UsW5YLtG2llpmWUnU2pAJN6LVYtNLY1NfUL/ZfEYuyJWgRXJM1vV6DQ5tn1kaL+b
 Ac3wrErrdYO48qONPbyA
 =nvm5
 -----END PGP SIGNATURE-----

Merge tag 'sound-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "It's a bad habit to get a higher volume of fixes often lately, but
  things happen again.

  All commits found here are real bug fixes, and are mostly trivial.
  Most of changes in ASoC are the fixes for enum items due to the wrong
  API usages, in addition to a few DAPM mutex deadlock and other fixes.
  In HD-audio, only fixups for HP laptops.  Although diffstat shows
  much, the changes are simple: there are just so many different device
  entries there"

* tag 'sound-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: sta32x: Fix wrong enum for limiter2 release rate
  ASoC: da732x: Mark DC offset control registers volatile
  ALSA: hda/realtek - Add more entry for enable HP mute led
  ALSA: hda - Add a fixup for HP Folio 13 mute LED
  ASoC: wm8958-dsp: Fix firmware block loading
  ASoC: sta32x: Fix cache sync
  ALSA: hda/realtek - Add more entry for enable HP mute led
  ASoC: dapm: Add locking to snd_soc_dapm_xxxx_pin functions
  Input - arizona-haptics: Fix double lock of dapm_mutex
  ASoC: wm8400: Fix the wrong number of enum items
  ASoC: isabelle: Fix the wrong number of items in enum ctls
  ASoC: ad1980: Fix wrong number of items for capture source
  ASoC: wm8994: Fix the wrong number of enum items
  ASoC: wm8900: Fix the wrong number of enum items
  ASoC: wm8770: Fix wrong number of enum items
  ASoC: sta32x: Fix array access overflow
  ASoC: dapm: Correct regulator bypass error messages
2014-02-28 11:50:32 -08:00
David S. Miller
23187212e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
1) Build fix for ip_vti when NET_IP_TUNNEL is not set.
   We need this set to have ip_tunnel_get_stats64()
   available.

2) Fix a NULL pointer dereference on sub policy usage.
   We try to access a xfrm_state from the wrong array.

3) Take xfrm_state_lock in xfrm_migrate_state_find(),
   we need it to traverse through the state lists.

4) Clone states properly on migration, otherwise we crash
   when we migrate a state with aead algorithm attached.

5) Fix unlink race when between thread context and timer
   when policies are deleted.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27 16:19:41 -05:00
Linus Torvalds
86c7654f4a Metag arch and asm-generic fixes for v3.14
- Add the new sched_setattr/sched_getattr syscalls to the asm-generic
   syscall list, which is used by arc, arm64, c6x, hexagon, metag,
   openrisc, score, tile, and unicore32.
 
 - An IRQ affinity bug fix for metag to prevent interrupts being vectored
   to offline CPUs when their affinity is changed via /proc/irq/ (thanks
   tglx).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJTDwkFAAoJEKHZs+irPybfR/wP/RP4hhdad9i+Poc86PAj0vGT
 6KyNgUifceh9WfQbHom84GvBnhLB66J2Uc4uUX6F+xdUmPUvhhn4Nac39c4exMnf
 klWoqcKkm8jb1kGLVuUdlgSssqLudEca/C5ArbOpZfvVwQxFf+WNpyWUNsNlvS/I
 M1chio/jyWewVq2K4CmB/ZdoQ8lC7EVKH1FwCuMOsV9h7M32eqvpQgopSiKnKFw1
 0o9a775PUvEK9PvEMa16ALShjch72ATFM9cVgFULr9M3BjHCn2H/jaNBzAhBLGSa
 +sAcgFBQumLrK1HqyThW7Fm5rXQwMuru3nHYevBFXwLKrVSMjqN+tfCEY4oLGwSz
 ywGc0rqNR7AuBzL/dnsjdBNhg30rvMIvWb/F6o/SM2IJNvnQjNJbKA97qdCqOevH
 hpyycx/zAbEtX/8hsPHRkCNg9+Ms0wcH9NJ8u5ul24WVAd1rYCcgmHWNdIFpkE9G
 fA6qckAye9/3xREi7gA50sl02Dxq4xyYaAfv72lOLEy+MIK/e8xMJL+PuI2eVt+g
 J0AsjmuVNuNPK64eohmBBgXsS+7czMTuaHvHOCP6897bipUHAzx//UC347WOar1p
 LvbLUpK9T5YGUxfoZ/ZlD8hi9myxKqIFNJnfJRwIhMpkYebFkqOsYRYXuZ6zPtDT
 haQisQWsM7eLTMe9qRr+
 =fVZo
 -----END PGP SIGNATURE-----

Merge tag 'metag-fixes-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag

Pull Metag arch and asm-generic fixes from James Hogan:

 - Add the new sched_setattr/sched_getattr syscalls to the asm-generic
   syscall list, which is used by arc, arm64, c6x, hexagon, metag,
  openrisc, score, tile, and unicore32.

 - An IRQ affinity bug fix for metag to prevent interrupts being
   vectored to offline CPUs when their affinity is changed via
   /proc/irq/ (thanks tglx).

* tag 'metag-fixes-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
  irq-metag*: stop set_affinity vectoring to offline cpus
  asm-generic: add sched_setattr/sched_getattr syscalls
2014-02-27 10:54:52 -08:00
Linus Torvalds
8d7531825c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull filesystem fixes from Jan Kara:
 "Notification, writeback, udf, quota fixes

  The notification patches are (with one exception) a fallout of my
  fsnotify rework which went into -rc1 (I've extented LTP to cover these
  cornercases to avoid similar breakage in future).

  The UDF patch is a nasty data corruption Al has recently reported,
  the revert of the writeback patch is due to possibility of violating
  sync(2) guarantees, and a quota bug can lead to corruption of quota
  files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Allocate overflow events with proper type
  fanotify: Handle overflow in case of permission events
  fsnotify: Fix detection whether overflow event is queued
  Revert "writeback: do not sync data dirtied after sync start"
  quota: Fix race between dqput() and dquot_scan_active()
  udf: Fix data corruption on file type conversion
  inotify: Fix reporting of cookies for inotify events
2014-02-27 10:37:22 -08:00
Davidlohr Bueso
f3713fd9cf ipc,mqueue: remove limits for the amount of system-wide queues
Commit 93e6f119c0 ("ipc/mqueue: cleanup definition names and
locations") added global hardcoded limits to the amount of message
queues that can be created.  While these limits are per-namespace,
reality is that it ends up breaking userspace applications.
Historically users have, at least in theory, been able to create up to
INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
for some workloads and use cases.  For instance, Madars reports:

 "This update imposes bad limits on our multi-process application.  As
  our app uses approaches that each process opens its own set of queues
  (usually something about 3-5 queues per process).  In some scenarios
  we might run up to 3000 processes or more (which of-course for linux
  is not a problem).  Thus we might need up to 9000 queues or more.  All
  processes run under one user."

Other affected users can be found in launchpad bug #1155695:
  https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

Instead of increasing this limit, revert it entirely and fallback to the
original way of dealing queue limits -- where once a user's resource
limit is reached, and all memory is used, new queues cannot be created.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Madars Vitolins <m@silodev.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>	[3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-25 15:25:45 -08:00
Li Zefan
fed95bab8d sysfs: fix namespace refcnt leak
As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
 ---
 fs/kernfs/mount.c      | 8 +++++++-
 fs/sysfs/mount.c       | 5 +++--
 include/linux/kernfs.h | 9 +++++----
 3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-25 07:37:52 -08:00
Jan Kara
ff57cd5863 fsnotify: Allocate overflow events with proper type
Commit 7053aee26a "fsnotify: do not share events between notification
groups" used overflow event statically allocated in a group with the
size of the generic notification event. This causes problems because
some code looks at type specific parts of event structure and gets
confused by a random data it sees there and causes crashes.

Fix the problem by allocating overflow event with type corresponding to
the group type so code cannot get confused.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-25 11:18:06 +01:00
Mike Turquette
10b7cdc008 Merge branch 'clocks/fixes/drivers' of git://linuxtv.org/pinchartl/fbdev into clk-fixes 2014-02-24 22:21:29 -08:00
James Hogan
e6cfc0295c asm-generic: add sched_setattr/sched_getattr syscalls
Add the sched_setattr and sched_getattr syscalls to the generic syscall
list, which is used by the following architectures: arc, arm64, c6x,
hexagon, metag, openrisc, score, tile, unicore32.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: linux-c6x-dev@linux-c6x.org
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-hexagon@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux@lists.openrisc.net
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
2014-02-24 11:55:20 +00:00
Mark Brown
45d39cbf00 Merge remote-tracking branch 'asoc/fix/dapm' into asoc-linus 2014-02-23 12:20:32 +09:00
Eric Dumazet
f5ddcbbb40 net-tcp: fastopen: fix high order allocations
This patch fixes two bugs in fastopen :

1) The tcp_sendmsg(...,  @size) argument was ignored.

   Code was relying on user not fooling the kernel with iovec mismatches

2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
allocations, which are likely to fail when memory gets fragmented.

Fixes: 783237e8da ("net-tcp: Fast Open client - sending SYN-data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-22 00:05:21 -05:00
Jan Kara
0dc83bd30b Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-22 02:02:28 +01:00
Peter Zijlstra
6d35ab4809 sched: Add 'flags' argument to sched_{set,get}attr() syscalls
Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: juri.lelli@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21 21:27:10 +01:00
Linus Torvalds
d158fc7f36 PCI updates for v3.14:
MSI
     - Fix AHCI single-MSI fallback (Alexander Gordeev)
     - Fix populate_msi_sysfs() error paths (Greg Kroah-Hartman)
     - Fix htmldocs problem (Masanari Iida)
     - Add pci_enable_msi_exact() and pci_enable_msix_exact() (Alexander Gordeev)
     - Update documentation (Alexander Gordeev)
 
   Miscellaneous
     - mvebu: expose device ID & revision via lspci (Andrew Lunn)
     - Enable INTx if the BIOS left them disabled (Bjorn Helgaas)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTBkj5AAoJEFmIoMA60/r8aiAQAIWQnZ7UhXBqMAXDR8nJuTbk
 b2l4EpNtrGPKy27ZogwDV7ACE7BcBc8vQWhsuMbaxyYTUh4Amr19CysjyBqmoLe9
 4eMuGlItkXCbtEw8wquiSz8rtUHH90yTwXk3XMQ0SkscMuAp+QSUb48a3uBSPMX/
 gf29IeV8CJjqfLnvtCYkp9jgVuph9vpw+g+DTaLPGA23KS8QJKvmJ95R15fhfcGZ
 B4fbJG8TT8LLLD4LDeZOSqbz2n4rE8Xaif1locLAkQtPhiSe65vZYP5IFwlH/t4T
 Rzqtkuy2pbybfMk2JVDXzXQgIbCH0h3fEYRZM7ydhU3dndb1E8oUAYf1CbG1GoLv
 36feVn7YWs3VQhs+IpoqJivtgmQKOmFgtGByPOgP47SWXssmyBz2DZCap6WPVGGb
 KCJNshSGtpNA3ge34jj8Y5wKN2Y+jGoBvLPObJd80Rwwmx00Nn33jn4dYX9JkPlB
 kq4I9+y8CmMuADB+St3kHklAw0qFeK7pj2iMRnpfdEbau4el16ch8S9rEBltOj/2
 wMejSViUH1RsdpJMMHads3pR+oAjFxxc8x1fnp4roIr2SkvZhCmcZwM6GJJhMJpi
 RM/B4RnK4dMuE6vGX5jsDQFy7xKoE6Wfop/cXK6HbifX+kiZo90PD8vbNthFJ/Wy
 2B0AN2cvL5dCKvoX2gqJ
 =CTv7
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.14-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "The most interesting thing here is the change to enable INTx (by
  clearing PCI_COMMAND_INTX_DISABLE) if the BIOS left INTx disabled.
  Apparently the Baytrail BIOS does this, which means EHCI doesn't work.

  Also, fix an AHCI MSI regression and other issues with the recent MSI
  changes.  This also adds pci_enable_msi_exact() and
  pci_enable_msix_exact(), which aren't regression fixes, but will keep
  us from touching drivers twice (once to stop using the deprecated
  pci_enable_msi(), etc., and again to use the *_exact() variants).

  There's also a minor MVEBU fix.

  Summary:

  MSI:
    - Fix AHCI single-MSI fallback (Alexander Gordeev)
    - Fix populate_msi_sysfs() error paths (Greg Kroah-Hartman)
    - Fix htmldocs problem (Masanari Iida)
    - Add pci_enable_msi_exact() and pci_enable_msix_exact() (Alexander Gordeev)
    - Update documentation (Alexander Gordeev)

  Miscellaneous:
    - mvebu: expose device ID & revision via lspci (Andrew Lunn)
    - Enable INTx if the BIOS left them disabled (Bjorn Helgaas)"

* tag 'pci-v3.14-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  ahci: Fix broken fallback to single MSI mode
  PCI: Enable INTx if BIOS left them disabled
  PCI/MSI: Add pci_enable_msi_exact() and pci_enable_msix_exact()
  PCI/MSI: Fix cut-and-paste errors in documentation
  PCI/MSI: Add pci_enable_msi() documentation back
  PCI/MSI: Fix pci_msix_vec_count() htmldocs failure
  PCI/MSI: Fix leak of msi_attrs
  PCI/MSI: Check kmalloc() return value, fix leak of name
  PCI: mvebu: Use Device ID and revision from underlying endpoint
2014-02-20 12:46:24 -08:00
Linus Torvalds
6a4d07f85b Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Quite a few fixes this time.

  Three locking fixes, all marked for -stable.  A couple error path
  fixes and some misc fixes.  Hugh found a bug in memcg offlining
  sequence and we thought we could fix that from cgroup core side but
  that turned out to be insufficient and got reverted.  A different fix
  has been applied to -mm"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: update cgroup_enable_task_cg_lists() to grab siglock
  Revert "cgroup: use an ordered workqueue for cgroup destruction"
  cgroup: protect modifications to cgroup_idr with cgroup_mutex
  cgroup: fix locking in cgroup_cfts_commit()
  cgroup: fix error return from cgroup_create()
  cgroup: fix error return value in cgroup_mount()
  cgroup: use an ordered workqueue for cgroup destruction
  nfs: include xattr.h from fs/nfs/nfs3proc.c
  cpuset: update MAINTAINERS entry
  arm, pm, vmpressure: add missing slab.h includes
2014-02-20 12:01:09 -08:00
Linus Torvalds
2b73d207a5 Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Two workqueue fixes.  One for an unlikely but possible critical bug
  during kworker shutdown and the other to make lockdep names a bit more
  descriptive"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: ensure @task is valid across kthread_stop()
  workqueue: add args to workqueue lockdep name
2014-02-20 12:00:27 -08:00
Nicolas Dichtel
cf71d2bc0b sit: fix panic with route cache in ip tunnels
Bug introduced by commit 7d442fab0a ("ipv4: Cache dst in tunnels").

Because sit code does not call ip_tunnel_init(), the dst_cache was not
initialized.

CC: Tom Herbert <therbert@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20 13:13:50 -05:00
Peter De Schrijver
c7fbd41584 clk: tegra124: remove gr2d and gr3d clocks
Tegra124 does not have gr2d and gr3d clocks. They have been replaced by the
vic03 and gpu clocks respectively.

Signed-off-by: Peter De Schrijver <pdeschrijver@nvidia.com>
2014-02-20 19:10:58 +02:00
Steffen Klassert
ee5c23176f xfrm: Clone states properly on migration
We loose a lot of information of the original state if we
clone it with xfrm_state_clone(). In particular, there is
no crypto algorithm attached if the original state uses
an aead algorithm. This patch add the missing information
to the clone state.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-20 14:30:10 +01:00
Charles Keepax
1139110064 ASoC: dapm: Add locking to snd_soc_dapm_xxxx_pin functions
The snd_soc_dapm_xxxx_pin all require the dapm_mutex to be held when
they are called as they edit the dirty list, however very few of the
callers do so.

This patch adds unlocked versions of all the functions replacing the
existing implementations with one that holds the lock internally. We
also fix up the places where the lock was actually held on the caller
side.

Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
Cc: stable@vger.kernel.org
2014-02-20 18:40:07 +09:00
Linus Torvalds
981adacd39 MFD fixes due for the v3.14 -rcs
Couple of small issues solved:
   - Suspend/Resume call-backs require CONFIG_PM_SLEEP
   - Some drivers written for 32bit architectures fail when compiled
     with a 64bit compiler. The fixes will future proof the drivers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTBLjhAAoJEFGvii+H/HdhRn4P/2ovuuE0pGm4OkwIi+5yOM0S
 ABrW3UhHXfEW9S8rj6ZpDsgOjio+ZfojIbeeK59j01KkqC3l1mzz66wIfRUpjisk
 CPtERrSj/BzGgXiJpMgBEdugYIegSlygZOCTSj1fESDG4T3XMShaHvlZPj6Oe87x
 YF83wyqxd59+SFTnmwRnNo2RqXKSgkO+Gl0Rx6/CHQdHk3IXIforC0pLPPx5pNmZ
 FuIN8+hWBree6ih8nCPLmQI05KwU74U+NKWO+CKBkdAt+SJ8+3cr16+zfoFu351m
 4ZedKAVZ7O3KUpdhIzDAQZzenz5VaVuo0KvQc4ZEgPlAWP3RmOsaUBaoCKI8LWlm
 Th2kkBTRleZuxA6psb3craXIasvkVLfLcVVYVAfPU/i+VqVDp9c24BeWDPVoJOLa
 //09ND+TokHjBWB+NnITO6dH240k7j6QY09pOpWJnJdWGwZSIo7/rZgDDKFIFes+
 EcuN4nrurG379xsffMKNXgtYgLj7Okvn3lPZp1E+c7KF8MMd2o13mAQdGQFl+7V1
 bbIKBY5nuL95pCS2DQktuu4WiQaQ5ONWFFdJ3iTpB5lv27eU6FyAPK2GmVU2xwT+
 +tAjVwldww1JInNZrDsdLPn+PSKUGKAZWcNOSedjdHUzV0I2emYzXUAttLVjtp3m
 380u+hdzHAd959XT2Mie
 =705r
 -----END PGP SIGNATURE-----

Merge tag 'mfd-fixes-3.14-1' of git://git.linaro.org/people/lee.jones/mfd

Pull MFD fixes from Lee Jones:
 "Couple of small issues solved:
   - Suspend/Resume call-backs require CONFIG_PM_SLEEP
   - Some drivers written for 32bit architectures fail when compiled
     with a 64bit compiler.  The fixes will future proof the drivers"

* tag 'mfd-fixes-3.14-1' of git://git.linaro.org/people/lee.jones/mfd:
  mfd: sec-core: sec_pmic_{suspend,resume}() should depend on CONFIG_PM_SLEEP
  mfd: max14577: max14577_{suspend,resume}() should depend on CONFIG_PM_SLEEP
  mfd: tps65217: Naturalise cross-architecture discrepancies
  mfd: wm8994-core: Naturalise cross-architecture discrepancies
  mfd: max8998: Naturalise cross-architecture discrepancies
  mfd: max8997: Naturalise cross-architecture discrepancies
2014-02-19 12:04:06 -08:00
David S. Miller
2e99c07fbe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

* Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal.

* Don't use the fast payload operation in nf_tables if the length is
  not power of 2 or it is not aligned, from Nikolay Aleksandrov.

* Fix missing break statement the inet flavour of nft_reject, which
  results in evaluating IPv4 packets with the IPv6 evaluation routine,
  from Patrick McHardy.

* Fix wrong kconfig symbol in nft_meta to match the routing realm,
  from Paul Bolle.

* Allocate the NAT null binding when creating new conntracks via
  ctnetlink to avoid that several packets race at initializing the
  the conntrack NAT extension, original patch from Florian Westphal,
  revisited version from me.

* Fix DNAT handling in the snmp NAT helper, the same handling was being
  done for SNAT and DNAT and 2.4 already contains that fix, from
  Francois-Xavier Le Bail.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19 13:12:53 -05:00
Lee Jones
5c6fbd56d1 mfd: tps65217: Naturalise cross-architecture discrepancies
If we compile the TPS65217 for a 64bit architecture we receive the following
warnings:

drivers/mfd/tps65217.c: In function ‘tps65217_probe’:
drivers/mfd/tps65217.c:173:13:
  warning: cast from pointer to integer of different size
   chip_id = (unsigned int)match->data;
             ^

Signed-off-by: Lee Jones <lee.jones@linaro.org>
2014-02-19 13:30:30 +00:00
Lee Jones
8bace2d5b4 mfd: max8998: Naturalise cross-architecture discrepancies
If we compile the MAX8998 for a 64bit architecture we receive the following
warnings:

  drivers/mfd/max8998.c: In function ‘max8998_i2c_get_driver_data’:
  drivers/mfd/max8998.c:178:10:
    warning: cast from pointer to integer of different size
     return (int)match->data;
            ^

Signed-off-by: Lee Jones <lee.jones@linaro.org>
2014-02-19 13:30:25 +00:00
Lee Jones
05fb7a56ad mfd: max8997: Naturalise cross-architecture discrepancies
If we compile the MAX8997 for a 64bit architecture we receive the following
warnings:

  drivers/mfd/max8997.c: In function ‘max8997_i2c_get_driver_data’:
  drivers/mfd/max8997.c:173:10:
    warning: cast from pointer to integer of different size
     return (int)match->data;
            ^

Signed-off-by: Lee Jones <lee.jones@linaro.org>
2014-02-19 13:30:23 +00:00
Linus Torvalds
960dfc4eb2 Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "Lots of little small things, nothing too major: nouveau regression
  fixes, vmware fixes for the new hw support, memory leaks in error path
  fixes"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (31 commits)
  drm/radeon/ni: fix typo in dpm sq ramping setup
  drm/radeon/si: fix typo in dpm sq ramping setup
  drm/radeon: fix CP semaphores on CIK
  drm/radeon: delete a stray tab
  drm/radeon: fix display tiling setup on SI
  drm/radeon/dpm: reduce r7xx vblank mclk threshold to 200
  drm/radeon: fill in DRM_CAPs for cursor size
  drm: add DRM_CAPs for cursor size
  drm/radeon: unify bpc handling
  drm/ttm: Fix memory leak in ttm_agp_backend.c
  drm/ttm: declare 'struct device' in ttm_page_alloc.h
  drm/nouveau: fix TTM_PL_TT memtype on pre-nv50
  drm/nv50/disp: use correct register to determine DP display bpp
  drm/nouveau/fb: use correct ram oclass for nv1a hardware
  drm/nv50/gr: add missing nv_error parameter priv
  drm/nouveau: fix ENG_RUNLIST register address
  drm/nv4c/bios: disallow retrieving from prom on nv4x igp's
  drm/nv4c/vga: decode register is in a different place on nv4x igp's
  drm/nv4c/mc: nv4x igp's have a different msi rearm register
  drm/nouveau: set irq_enabled manually
  ...
2014-02-18 16:36:07 -08:00
Linus Torvalds
b0d3f6d47e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) kvaser CAN driver has fixed limits of some of it's table, validate
    that we won't exceed those limits at probe time.  Fix from Olivier
    Sobrie.

 2) Fix rtl8192ce disabling interrupts for too long, from Olivier
    Langlois.

 3) Fix botched shift in ath5k driver, from Dan Carpenter.

 4) Fix corruption of deferred packets in TIPC, from Erik Hugne.

 5) Fix newlink error path in macvlan driver, from Cong Wang.

 6) Fix netpoll deadlock in bonding, from Ding Tianhong.

 7) Handle GSO packets properly in forwarding path when fragmentation is
    necessary on egress, from Florian Westphal.

 8) Fix axienet build errors, from Michal Simek.

 9) Fix refcounting of ubufs on tx in vhost net driver, from Michael S
    Tsirkin.

10) Carrier status isn't set properly in hyperv driver, from Haiyang
    Zhang.

11) Missing pci_disable_device() in tulip_remove_one), from Ingo Molnar.

12) AF_PACKET qdisc bypass mode doesn't adhere to driver provided TX
    queue selection method.  Add a fallback method mechanism to fix this
    bug, from Daniel Borkmann.

13) Fix regression in link local route handling on GRE tunnels, from
    Nicolas Dichtel.

14) Bonding can assign dup aggregator IDs in some sequences of
    configuration, fix by making the allocation counter per-bond instead
    of global.  From Jiri Bohac.

15) sctp_connectx() needs compat translations, from Daniel Borkmann.

16) Fix of_mdio PHY interrupt parsing, from Ben Dooks

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
  MAINTAINERS: add entry for the PHY library
  of_mdio: fix phy interrupt passing
  net: ethernet: update dependency and help text of mvneta
  NET: fec: only enable napi if we are successful
  af_packet: remove a stray tab in packet_set_ring()
  net: sctp: fix sctp_connectx abi for ia32 emulation/compat mode
  ipv4: fix counter in_slow_tot
  irtty-sir.c: Do not set_termios() on irtty_close()
  bonding: 802.3ad: make aggregator_identifier bond-private
  usbnet: remove generic hard_header_len check
  gre: add link local route when local addr is any
  batman-adv: fix potential kernel paging error for unicast transmissions
  batman-adv: avoid double free when orig_node initialization fails
  batman-adv: free skb on TVLV parsing success
  batman-adv: fix TT CRC computation by ensuring byte order
  batman-adv: fix potential orig_node reference leak
  batman-adv: avoid potential race condition when adding a new neighbour
  batman-adv: properly check pskb_may_pull return value
  batman-adv: release vlan object after checking the CRC
  batman-adv: fix TT-TVLV parsing on OGM reception
  ...
2014-02-18 15:52:43 -08:00
Dave Airlie
75936c65dd Merge tag 'ttm-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux into drm-fixes
Pull request of 2014-02-18

One compile fix and one memory leak.

* tag 'ttm-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux:
  drm/ttm: Fix memory leak in ttm_agp_backend.c
  drm/ttm: declare 'struct device' in ttm_page_alloc.h
2014-02-19 08:21:26 +10:00
Dave Airlie
9830e44f56 Merge tag 'vmwgfx-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux into drm-fixes
Pull request of 2014-02-18.

Nothing special. The biggest change is adding a couple of command defines and
packing the command data correctly.

* tag 'vmwgfx-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux:
  drm/vmwgfx: Fix command defines and checks
  drm/vmwgfx: Fix possible integer overflow
  drm/vmwgfx: Remove stray const
  drm/vmwgfx: unlock on error path in vmw_execbuf_process()
  drm/vmwgfx: Get maximum mob size from register SVGA_REG_MOB_MAX_SIZE
  drm/vmwgfx: Fix a couple of sparse warnings and errors
2014-02-19 08:21:02 +10:00
Alex Deucher
8716ed4e7b drm: add DRM_CAPs for cursor size
Some hardware may not support standard 64x64 cursors.  Add
a drm cap to query the cursor size from the kernel.  Some examples
include radeon CIK parts (128x128 cursors) and armada (32x64 or 64x32).
This allows things like device specific ddxes to remove asics specific
logic and also allows xf86-video-modesetting to work properly with hw
cursors on this hardware. Default to 64 if the driver doesn't specify
a size.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
2014-02-18 13:41:01 -05:00
Alexandre Courbot
728a0cdf06 drm/ttm: declare 'struct device' in ttm_page_alloc.h
Declare 'struct device' explicitly in ttm_page_alloc.h as this file
does not include any file declaring it. This removes the following
warning:

	warning: 'struct device' declared inside parameter list

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
2014-02-18 14:01:48 +01:00
Jan Kara
45a22f4c11 inotify: Fix reporting of cookies for inotify events
My rework of handling of notification events (namely commit 7053aee26a
"fsnotify: do not share events between notification groups") broke
sending of cookies with inotify events. We didn't propagate the value
passed to fsnotify() properly and passed 4 uninitialized bytes to
userspace instead (so it is also an information leak). Sadly I didn't
notice this during my testing because inotify cookies aren't used very
much and LTP inotify tests ignore them.

Fix the problem by passing the cookie value properly.

Fixes: 7053aee26a
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-18 11:17:17 +01:00
Linus Torvalds
87eeff7974 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "We have some patches fixing up ACL support issues from Zheng and
  Guangliang and a mount option to enable/disable this support.  (These
  fixes were somewhat delayed by the Chinese holiday.)

  There is also a small fix for cached readdir handling when directories
  are fragmented"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix __dcache_readdir()
  ceph: add acl, noacl options for cephfs mount
  ceph: make ceph_forget_all_cached_acls() static inline
  ceph: add missing init_acl() for mkdir() and atomic_open()
  ceph: fix ceph_set_acl()
  ceph: fix ceph_removexattr()
  ceph: remove xattr when null value is given to setxattr()
  ceph: properly handle XATTR_CREATE and XATTR_REPLACE
2014-02-17 13:51:00 -08:00
Linus Torvalds
60f76eab19 Small dma-buf pull request for 3.14
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTAgaqAAoJEAG+/NWsLn5b+vAP/1te3UC4H6QvjLTfp+CLLPyM
 pSo0jDC/JP4MdkPajiI8ALGkCtO0TWoL+aoG+i+icUKoPx9Pjo+c8nmfgVzcpWeg
 FeVyq8zWL4akLQJIgypNOpX27sJ0jPHaCyct70+c8LdGHeJErIczrKWznh8MLHj4
 57Y2jm/3ge98Ib9rNy3+Bo2IrnyKjggKOq9/y0W0pYkACI+MuOioSsziw9fq0rKW
 yWe6XJg1L22Qd0tkm+tBfUQZxzJNX+rpGDVBqFib02jVOrcH6uGRamoxdPQSck51
 SUroKBYhLWYVYcTP/zu7JRMWk9Zcqwjv9CdzSi/GEA3775Py26SNoBcncTzKUj8L
 XsWFafFuA6RIOd7yArfq24vsv9x0YycK+uqlZi+AvjyZFxUqB2s9k5Y/u5LA8qeY
 XZj78zkPBGt/a6IjjwfBTo5xIQ15VoO5hvhUbKxuKvXZhMZKmtuE0oDzjPNe3Gc+
 C3CgJISAzCFZuw+THbvztCxCp5ydNkCn/SNCU3gNnH/nC0Lw2iK2+EASXJzb7Cc/
 10gazyXX9N3Ac8NXuBpHXbQaAPgll8/ArG2CvMkDqvjIMf04Gz5WWnsQ/RQjaI0y
 gRA4EcsMNTsQ5IH9f6kM2LAPbPeeyNbd13LGYL/yCZy/dn3C9rkPREno490e2XYA
 8ST+uKjzaIcFB+FHbHSS
 =TtdN
 -----END PGP SIGNATURE-----

Merge tag 'dma-buf-for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sumits/dma-buf

Pull dma-buf fix from Sumit Semwal:
 "Just some debugfs output updates.

  There's another patch related to dma-buf, but it'll get upstreamed via
  Greg KH's pull request"

* tag 'dma-buf-for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sumits/dma-buf:
  dma-buf: update debugfs output
2014-02-17 12:42:45 -08:00
Yan, Zheng
bcdfeb2eb4 ceph: remove xattr when null value is given to setxattr()
For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE
to distinguish null value case from the zero-length value case.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:09 -08:00
Linus Torvalds
f2a77abdb8 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes from Ben Herrenschmidt:
 "Here are some more powerpc fixes for 3.14

  The main one is a nasty issue with the NUMA balancing support which
  requires a small generic change and the addition of a new accessor to
  set _PAGE_NUMA.  Both have been reviewed and acked by Mel and Rik.

  The changelog should have plenty of details but basically, without
  this fix, we get random user segfaults and/or corruptions due to
  missing TLB/hash flushes.  Aneesh series of 3 patches fixes it.

  We have some vDSO vs.  perf fixes from Anton, some small EEH fixes
  from Gavin, a ppc32 regression vs the stack overflow detector, and a
  fix for the way we handle PCIe host bridge speed settings on pseries
  (which is needed for proper operations of AMD graphics cards on
  Power8)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/eeh: Disable EEH on reboot
  powerpc/eeh: Cleanup on eeh_subsystem_enabled
  powerpc/powernv: Rework EEH reset
  powerpc: Use unstripped VDSO image for more accurate profiling data
  powerpc: Link VDSOs at 0x0
  mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit
  mm: Dirty accountable change only apply to non prot numa case
  powerpc/mm: Add new "set" flag argument to pte/pmd update function
  powerpc/pseries: Add Gen3 definitions for PCIE link speed
  powerpc/pseries: Fix regression on PCI link speed
  powerpc: Set the correct ksp_limit on ppc32 when switching to irq stack
2014-02-17 12:36:49 -08:00
Florian Westphal
478b360a47 netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.

Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.

Move this into __nf_copy() helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-17 11:20:12 +01:00
Daniel Borkmann
b9507bdaf4 netdevice: move netdev_cap_txqueue for shared usage to header
In order to allow users to invoke netdev_cap_txqueue, it needs to
be moved into netdevice.h header file. While at it, also add kernel
doc header to document the API.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 00:36:34 -05:00
Daniel Borkmann
99932d4fc0 netdevice: add queue selection fallback handler for ndo_select_queue
Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).

This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 00:36:34 -05:00
Matija Glavinic Pecotic
ef2820a735 net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 00:16:56 -05:00
Aneesh Kumar K.V
56eecdb912 mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit
Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
a hash table MMU for various reasons (the flush is handled as part of
the PTE modification when necessary).

ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

Additionally ppc64 require the tlb flushing to be batched within ptl locks.

The reason to do that is to ensure that the hash page table is in sync with
linux page table.

We track the hpte index in linux pte and if we clear them without flushing
hash and drop the ptl lock, we can have another cpu update the pte and can
end up with duplicate entry in the hash table, which is fatal.

We also want to keep set_pte_at simpler by not requiring them to do hash
flush for performance reason. We do that by assuming that set_pte_at() is
never *ever* called on a PTE that is already valid.

This was the case until the NUMA code went in which broke that assumption.

Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
using set_pte_at() and a powerpc specific one using the appropriate
mechanism needed to keep the hash table in sync.

Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-17 11:19:36 +11:00