Commit Graph

37373 Commits

Author SHA1 Message Date
Herbert Xu
4c0ee414e8 Revert "net: Reset secmark when scrubbing packet"
This patch reverts commit b8fb4e0648
because the secmark must be preserved even when a packet crosses
namespace boundaries.  The reason is that security labels apply to
the system as a whole and is not per-namespace.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:19:02 -04:00
Alexei Starovoitov
a166151cbe bpf: fix bpf helpers to use skb->mac_header relative offsets
For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:08:49 -04:00
Wei Yongjun
5a950ad58d netns: remove duplicated include from net_namespace.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:14:24 -04:00
WANG Cong
540207ae69 fou: avoid missing unlock in failure path
Fixes: 7a6c8c34e5 ("fou: implement FOU_CMD_GET")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:11:19 -04:00
Linus Torvalds
eea3a00264 Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - the rest of MM

 - various misc bits

 - add ability to run /sbin/reboot at reboot time

 - printk/vsprintf changes

 - fiddle with seq_printf() return value

* akpm: (114 commits)
  parisc: remove use of seq_printf return value
  lru_cache: remove use of seq_printf return value
  tracing: remove use of seq_printf return value
  cgroup: remove use of seq_printf return value
  proc: remove use of seq_printf return value
  s390: remove use of seq_printf return value
  cris fasttimer: remove use of seq_printf return value
  cris: remove use of seq_printf return value
  openrisc: remove use of seq_printf return value
  ARM: plat-pxa: remove use of seq_printf return value
  nios2: cpuinfo: remove use of seq_printf return value
  microblaze: mb: remove use of seq_printf return value
  ipc: remove use of seq_printf return value
  rtc: remove use of seq_printf return value
  power: wakeup: remove use of seq_printf return value
  x86: mtrr: if: remove use of seq_printf return value
  linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
  MAINTAINERS: CREDITS: remove Stefano Brivio from B43
  .mailmap: add Ricardo Ribalda
  CREDITS: add Ricardo Ribalda Delgado
  ...
2015-04-15 16:39:15 -07:00
Rasmus Villemoes
41416f2330 lib/string_helpers.c: change semantics of string_escape_mem
The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf().  If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).

So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired.  Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.

This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst.  For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.

In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small.  We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.

In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.

In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.

[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:24 -07:00
Iulia Manda
2813893f8b kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Linus Torvalds
fa927894bb Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs update from Al Viro:
 "Now that net-next went in...  Here's the next big chunk - killing
  ->aio_read() and ->aio_write().

  There'll be one more pile today (direct_IO changes and
  generic_write_checks() cleanups/fixes), but I'd prefer to keep that
  one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  ->aio_read and ->aio_write removed
  pcm: another weird API abuse
  infinibad: weird APIs switched to ->write_iter()
  kill do_sync_read/do_sync_write
  fuse: use iov_iter_get_pages() for non-splice path
  fuse: switch to ->read_iter/->write_iter
  switch drivers/char/mem.c to ->read_iter/->write_iter
  make new_sync_{read,write}() static
  coredump: accept any write method
  switch /dev/loop to vfs_iter_write()
  serial2002: switch to __vfs_read/__vfs_write
  ashmem: use __vfs_read()
  export __vfs_read()
  autofs: switch to __vfs_write()
  new helper: __vfs_write()
  switch hugetlbfs to ->read_iter()
  coda: switch to ->read_iter/->write_iter
  ncpfs: switch to ->read_iter/->write_iter
  net/9p: remove (now-)unused helpers
  p9_client_attach(): set fid->uid correctly
  ...
2015-04-15 13:22:56 -07:00
David Howells
c5ef603528 VFS: net/: d_inode() annotations
socket inodes and sunrpc filesystems - inodes owned by that code

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:56 -04:00
David Howells
a25b376bde VFS: net/unix: d_backing_inode() annotations
places where we are dealing with S_ISSOCK file creation/lookups.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:56 -04:00
David Howells
ee8ac4d61c VFS: AF_UNIX sockets should call mknod on the top layer only
AF_UNIX sockets should call mknod on the top layer only and should not attempt
to modify the lower layer in a layered filesystem such as overlayfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:54 -04:00
Linus Torvalds
6c373ca893 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add BQL support to via-rhine, from Tino Reichardt.

 2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading.  From Floria Fainelli.

 3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

 4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

 5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

 6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc.  And use this to clean up the neigh layer, then use it to
    implement MPLS support.  All from Eric Biederman.

 7) Support L3 forwarding offloading in switches, from Scott Feldman.

 8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further.  From Alexander Duyck.

 9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf.  In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

12) Split out new connection request sockets so that they can be
    established in the main hash table.  Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath.  From Eric Dumazet.

13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

14) Support stable privacy address generation for RFC7217 in IPV6.  From
    Hannes Frederic Sowa.

15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
  fm10k: Bump driver version to 0.15.2
  fm10k: corrected VF multicast update
  fm10k: mbx_update_max_size does not drop all oversized messages
  fm10k: reset head instead of calling update_max_size
  fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
  fm10k: update xcast mode before synchronizing multicast addresses
  fm10k: start service timer on probe
  fm10k: fix function header comment
  fm10k: comment next_vf_mbx flow
  fm10k: don't handle mailbox events in iov_event path and always process mailbox
  fm10k: use separate workqueue for fm10k driver
  fm10k: Set PF queues to unlimited bandwidth during virtualization
  fm10k: expose tx_timeout_count as an ethtool stat
  fm10k: only increment tx_timeout_count in Tx hang path
  fm10k: remove extraneous "Reset interface" message
  fm10k: separate PF only stats so that VF does not display them
  fm10k: use hw->mac.max_queues for stats
  fm10k: only show actual queues, not the maximum in hardware
  fm10k: allow creation of VLAN on default vid
  fm10k: fix unused warnings
  ...
2015-04-15 09:00:47 -07:00
Linus Torvalds
1dcf58d6e6 Merge branch 'akpm' (patches from Andrew)
Merge first patchbomb from Andrew Morton:

 - arch/sh updates

 - ocfs2 updates

 - kernel/watchdog feature

 - about half of mm/

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
  Documentation: update arch list in the 'memtest' entry
  Kconfig: memtest: update number of test patterns up to 17
  arm: add support for memtest
  arm64: add support for memtest
  memtest: use phys_addr_t for physical addresses
  mm: move memtest under mm
  mm, hugetlb: abort __get_user_pages if current has been oom killed
  mm, mempool: do not allow atomic resizing
  memcg: print cgroup information when system panics due to panic_on_oom
  mm: numa: remove migrate_ratelimited
  mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
  mm: split ET_DYN ASLR from mmap ASLR
  s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
  mm: expose arch_mmap_rnd when available
  s390: standardize mmap_rnd() usage
  powerpc: standardize mmap_rnd() usage
  mips: extract logic for mmap_rnd()
  arm64: standardize mmap_rnd() usage
  x86: standardize mmap_rnd() usage
  arm: factor out mmap ASLR into mmap_rnd
  ...
2015-04-14 16:49:17 -07:00
David Rientjes
4167e9b2cf mm: remove GFP_THISNODE
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
allocator slowpath to fail without trying reclaim even though it may be
used in combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b8 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:03 -07:00
David S. Miller
bae97d8410 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

A final pull request, I know it's very late but this time I think it's worth a
bit of rush.

The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.

This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.

This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.

Patrick McHardy says:

====================
netfilter: nf_tables: concatenation support

The following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).

The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.

Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.

Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.

As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.

The patches are split up as follows:

 * the first five patches add length validation to register loads and
   stores to make sure we stay within bounds and prepare the validation
   functions for the new addressing mode

 * the next patches prepare for changing to 32 bit addressing by
   introducing a struct nft_regs, which holds the verdict register as
   well as the data registers. The verdict members are moved to a new
   struct nft_verdict to allow to pull struct nft_data out of the stack.

 * the next patches contain preparatory conversions of expressions and
   sets to use 32 bit addressing

 * the next patch introduces so far unused register conversion helpers
   for parsing and dumping register numbers over netlink

 * following is the real conversion to 32 bit addressing, consisting of
   replacing struct nft_data in struct nft_regs by an array of u32s and
   actually translating and validating the new register numbers.

 * the final two patches add support for variable sized data items and
   variable sized keys / data in set elements

The patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================

Patrick McHardy says:

====================
netfilter: nf_tables: dynamic stateful expression instantiation

The following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.

Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.

In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.

We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.

The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:

1. Limit the rate of new SSH connections per host, similar to iptables
   hashlimit:

        flow ip saddr timeout 60s \
        limit 10/second \
        accept

2. Account network traffic between each set of /24 networks:

        flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
        counter

3. Account traffic to each host per user:

        flow skuid . ip daddr \
        counter

4. Account traffic for each combination of source address and TCP flags:

        flow ip saddr . tcp flags \
        counter

The resulting set content after a Xmas-scan look like this:

{
        192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
        192.168.122.1 . ack : counter packets 74 bytes 3848,
        192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}

In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 18:51:19 -04:00
Linus Torvalds
ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
David S. Miller
87ffabb1f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 15:44:14 -04:00
Linus Torvalds
eeee78cf77 Some clean ups and small fixes, but the biggest change is the addition
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
 
 Tracepoints have helper functions for the TP_printk() called
 __print_symbolic() and __print_flags() that lets a numeric number be
 displayed as a a human comprehensible text. What is placed in the
 TP_printk() is also shown in the tracepoint format file such that
 user space tools like perf and trace-cmd can parse the binary data
 and express the values too. Unfortunately, the way the TRACE_EVENT()
 macro works, anything placed in the TP_printk() will be shown pretty
 much exactly as is. The problem arises when enums are used. That's
 because unlike macros, enums will not be changed into their values
 by the C pre-processor. Thus, the enum string is exported to the
 format file, and this makes it useless for user space tools.
 
 The TRACE_DEFINE_ENUM() solves this by converting the enum strings
 in the TP_printk() format into their number, and that is what is
 shown to user space. For example, the tracepoint tlb_flush currently
 has this in its format file:
 
      __print_symbolic(REC->reason,
         { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
         { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
         { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
         { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
 
 After adding:
 
      TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
      TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
      TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
      TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
 
 Its format file will contain this:
 
      __print_symbolic(REC->reason,
         { 0, "flush on task switch" },
         { 1, "remote shootdown" },
         { 2, "local shootdown" },
         { 3, "local mm shootdown" })
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVLBTuAAoJEEjnJuOKh9ldjHMIALdRS755TXCZGOf0r7O2akOR
 wMPeum7C+ae1mH+jCsJKUC0/jUfQKaMt/UxoHlipDgcGg8kD2jtGnGCw4Xlwvdsr
 y4rFmcTRSl1mo0zDSsg6ujoupHlVYN0+JPjrd7S3cv/llJoY49zcanNLF7S2XLeM
 dZCtWRLWYpBiWO68ai6AqJTnE/eGFIqBI048qb5Eg8dbK243SSeSIf9Ywhb+VsA+
 aq6F7cWI/H6j4tbeza8tAN19dcwenDro5EfCDY8ARQHJu1f6Y3+DLf2imjkd6Aiu
 JVAoGIjHIpI+djwCZC1u4gi4urjfOqYartrM3Q54tb3YWYqHeNqP2ASI2a4EpYk=
 =Ixwt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Some clean ups and small fixes, but the biggest change is the addition
  of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.

  Tracepoints have helper functions for the TP_printk() called
  __print_symbolic() and __print_flags() that lets a numeric number be
  displayed as a a human comprehensible text.  What is placed in the
  TP_printk() is also shown in the tracepoint format file such that user
  space tools like perf and trace-cmd can parse the binary data and
  express the values too.  Unfortunately, the way the TRACE_EVENT()
  macro works, anything placed in the TP_printk() will be shown pretty
  much exactly as is.  The problem arises when enums are used.  That's
  because unlike macros, enums will not be changed into their values by
  the C pre-processor.  Thus, the enum string is exported to the format
  file, and this makes it useless for user space tools.

  The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
  the TP_printk() format into their number, and that is what is shown to
  user space.  For example, the tracepoint tlb_flush currently has this
  in its format file:

     __print_symbolic(REC->reason,
        { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
        { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
        { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
        { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })

  After adding:

     TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
     TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
     TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
     TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);

  Its format file will contain this:

     __print_symbolic(REC->reason,
        { 0, "flush on task switch" },
        { 1, "remote shootdown" },
        { 2, "local shootdown" },
        { 3, "local mm shootdown" })"

* tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
  tracing: Add enum_map file to show enums that have been mapped
  writeback: Export enums used by tracepoint to user space
  v4l: Export enums used by tracepoints to user space
  SUNRPC: Export enums in tracepoints to user space
  mm: tracing: Export enums in tracepoints to user space
  irq/tracing: Export enums in tracepoints to user space
  f2fs: Export the enums in the tracepoints to userspace
  net/9p/tracing: Export enums in tracepoints to userspace
  x86/tlb/trace: Export enums in used by tlb_flush tracepoint
  tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
  tracing: Allow for modules to convert their enums to values
  tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
  tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
  tracing: Give system name a pointer
  brcmsmac: Move each system tracepoints to their own header
  iwlwifi: Move each system tracepoints to their own header
  mac80211: Move message tracepoints to their own header
  tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
  tracing: Add TRACE_SYSTEM_VAR to kvm-s390
  tracing: Add TRACE_SYSTEM_VAR to intel-sst
  ...
2015-04-14 10:49:03 -07:00
Linus Torvalds
d0bbe0dd35 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Usual trivial tree updates.  Nothing outstanding -- mostly printk()
  and comment fixes and unused identifier removals"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  goldfish: goldfish_tty_probe() is not using 'i' any more
  powerpc: Fix comment in smu.h
  qla2xxx: Fix printks in ql_log message
  lib: correct link to the original source for div64_u64
  si2168, tda10071, m88ds3103: Fix firmware wording
  usb: storage: Fix printk in isd200_log_config()
  qla2xxx: Fix printk in qla25xx_setup_mode
  init/main: fix reset_device comment
  ipwireless: missing assignment
  goldfish: remove unreachable line of code
  coredump: Fix do_coredump() comment
  stacktrace.h: remove duplicate declaration task_struct
  smpboot.h: Remove unused function prototype
  treewide: Fix typo in printk messages
  treewide: Fix typo in printk messages
  mod_devicetable: fix comment for match_flags
2015-04-14 09:50:27 -07:00
David S. Miller
6e8a9d9148 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Al Viro says:

====================
netdev-related stuff in vfs.git

There are several commits sitting in vfs.git that probably ought to go in
via net-next.git.  First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago.  It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter.  The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO.  Next there's
afs and rxrpc stuff from dhowells.  And then there's sanitizing kernel_sendmsg
et.al.  + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g.  cifs stuff, but it belongs in net-next.

That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 18:18:05 -04:00
Eric Dumazet
789f558cfb tcp/dccp: get rid of central timewait timer
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:40:05 -04:00
Richard Weinberger
20a1d16526 netfilter: Fix format string of nfnetlink_log proc file
The printed values are all of type unsigned integer, therefore use
%u instead of %d. Otherwise an user can face negative values.

Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:35:17 -04:00
Richard Weinberger
6b46f7b7e9 netfilter: Fix format string of nfnetlink_queue proc file
The printed values are all of type unsigned integer, therefore use
%u instead of %d. Otherwise an user can face negative values.

Fixes:
$ cat /proc/net/netfilter/nfnetlink_queue
    0  29508   278 2 65531     0 2004213241 -2129885586  1
    1 -27747     0 2 65531     0     0        0  1
    2 -27748     0 2 65531     0     0        0  1

Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:35:16 -04:00
Richard Weinberger
cc6bc44863 netfilter: Fix portid types
The netlink portid is an unsigned integer, use this type
also in netfilter.

Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:35:16 -04:00
Richard Weinberger
65bc4f936e nfc: Fix portid type in urelease_work
portid is an unsigned integer. Fix urelease_work to
match all other portid user in the kernel.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:35:16 -04:00
Pablo Neira Ayuso
97bb43c3e0 netfilter: nf_tables: get rid of the expression example code
There's an example net/netfilter/nft_expr_template.c example file in tree that
got out of sync along time, remove it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
2015-04-13 20:20:09 +02:00
Patrick McHardy
3e135cd499 netfilter: nft_dynset: dynamic stateful expression instantiation
Support instantiating stateful expressions based on a template that
are associated with dynamically created set entries. The expressions
are evaluated when adding or updating the set element.

This allows to maintain per flow state using the existing set
infrastructure and expression types, with arbitrary definitions of
a flow.

Usage is currently restricted to anonymous sets, meaning only a single
binding can exist, since the desired semantics of multiple independant
bindings haven't been defined so far.

Examples (userspace syntax is still WIP):

1. Limit the rate of new SSH connections per host, similar to iptables
   hashlimit:

	flow ip saddr timeout 60s \
	limit 10/second \
	accept

2. Account network traffic between each set of /24 networks:

	flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
	counter

3. Account traffic to each host per user:

	flow skuid . ip daddr \
	counter

4. Account traffic for each combination of source address and TCP flags:

	flow ip saddr . tcp flags \
	counter

The resulting set content after a Xmas-scan look like this:

{
	192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
	192.168.122.1 . ack : counter packets 74 bytes 3848,
	192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 20:19:55 +02:00
Patrick McHardy
7c6c6e95a1 netfilter: nf_tables: add flag to indicate set contains expressions
Add a set flag to indicate that the set is used as a state table and
contains expressions for evaluation. This operation is mutually
exclusive with the mapping operation, so sets specifying both are
rejected. The lookup expression also rejects binding to state tables
since it only deals with loopup and map operations.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 20:12:32 +02:00
Patrick McHardy
151d799a61 netfilter: nf_tables: mark stateful expressions
Add a flag to mark stateful expressions.

This is used for dynamic expression instanstiation to limit the usable
expressions. Strictly speaking only the dynset expression can not be
used in order to avoid recursion, but since dynamically instantiating
non-stateful expressions will simply create an identical copy, which
behaves no differently than the original, this limits to expressions
where it actually makes sense to dynamically instantiate them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 20:12:31 +02:00
Patrick McHardy
f25ad2e907 netfilter: nf_tables: prepare for expressions associated to set elements
Preparation to attach expressions to set elements: add a set extension
type to hold an expression and dump the expression information with the
set element.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 20:12:31 +02:00
Patrick McHardy
0b2d8a7b63 netfilter: nf_tables: add helper functions for expression handling
Add helper functions for initializing, cloning, dumping and destroying
a single expression that is not part of a rule.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 20:12:31 +02:00
Kenneth Klette Jonassen
3d0d26c797 tcp: fix bogus RTT for CC when retransmissions are acked
Since retransmitted segments are not used for RTT estimation, previously
SACKed segments present in the rtx queue are used. This estimation can be
several times larger than the actual RTT. When a cumulative ack covers both
previously SACKed and retransmitted segments, CC may thus get a bogus RTT.

Such segments previously had an RTT estimation in tcp_sacktag_one(), so it
seems reasonable to not reuse them in tcp_clean_rtx_queue() at all.

Afaik, this has had no effect on SRTT/RTO because of Karn's check.

Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 13:54:25 -04:00
Daniel Borkmann
4577139b2d net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 13:34:40 -04:00
Patrick McHardy
7d7402642e netfilter: nf_tables: variable sized set element keys / data
This patch changes sets to support variable sized set element keys / data
up to 64 bytes each by using variable sized set extensions. This allows
to use concatenations with bigger data items suchs as IPv6 addresses.

As a side effect, small keys/data now don't require the full 16 bytes
of struct nft_data anymore but just the space they need.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:31 +02:00
Patrick McHardy
d0a11fc3dc netfilter: nf_tables: support variable sized data in nft_data_init()
Add a size argument to nft_data_init() and pass in the available space.
This will be used by the following patches to support variable sized
set element data.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:30 +02:00
Patrick McHardy
49499c3e6e netfilter: nf_tables: switch registers to 32 bit addressing
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.

The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.

To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.

Userspace fully passes the testsuite using both old and new register
addressing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:29 +02:00
Patrick McHardy
b1c96ed37c netfilter: nf_tables: add register parsing/dumping helpers
Add helper functions to parse and dump register values in netlink attributes.
These helpers will later be changed to take care of translation between the
old 128 bit and the new 32 bit register numbers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:28 +02:00
Patrick McHardy
8cd8937ac0 netfilter: nf_tables: convert sets to u32 data pointers
Simple conversion to use u32 pointers to the beginning of the data
area to keep follow up patches smaller.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:27 +02:00
Patrick McHardy
e562d860d7 netfilter: nf_tables: kill nft_data_cmp()
Only needlessly complicates things due to requiring specific argument
types. Use memcmp directly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:26 +02:00
Patrick McHardy
fad136ea0d netfilter: nf_tables: convert expressions to u32 register pointers
Simple conversion to use u32 pointers to the beginning of the registers
to keep follow up patches smaller.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:25 +02:00
Patrick McHardy
1ca2e1702c netfilter: nf_tables: use struct nft_verdict within struct nft_data
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:24 +02:00
Patrick McHardy
a55e22e92f netfilter: nf_tables: get rid of NFT_REG_VERDICT usage
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.

This is needed to seperate the verdict from the data registers completely,
so their size can be changed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:07 +02:00
Patrick McHardy
d07db9884a netfilter: nf_tables: introduce nft_validate_register_load()
Change nft_validate_input_register() to not only validate the input
register number, but also the length of the load, and rename it to
nft_validate_register_load() to reflect that change.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 16:25:50 +02:00
Patrick McHardy
27e6d2017a netfilter: nf_tables: kill nft_validate_output_register()
All users of nft_validate_register_store() first invoke
nft_validate_output_register(). There is in fact no use for using it
on its own, so simplify the code by folding the functionality into
nft_validate_register_store() and kill it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 16:25:50 +02:00
Patrick McHardy
58f40ab6e2 netfilter: nft_lookup: use nft_validate_register_store() to validate types
In preparation of validating the length of a register store, use
nft_validate_register_store() in nft_lookup instead of open coding the
validation.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 16:25:49 +02:00
Patrick McHardy
1ec10212f9 netfilter: nf_tables: rename nft_validate_data_load()
The existing name is ambiguous, data is loaded as well when we read from
a register. Rename to nft_validate_register_store() for clarity and
consistency with the upcoming patch to introduce its counterpart,
nft_validate_register_load().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 16:25:49 +02:00
Patrick McHardy
45d9bcda21 netfilter: nf_tables: validate len in nft_validate_data_load()
For values spanning multiple registers, we need to validate that enough
space is available from the destination register onwards. Add a len
argument to nft_validate_data_load() and consolidate the existing length
validations in preparation of that.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 16:25:49 +02:00
David S. Miller
e60a9de49c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-04-11

This series contains updates to iflink, ixgbe and ixgbevf.

The entire set of changes come from Vlad Zolotarov to ultimately add
the ethtool ops to VF driver to allow querying the RSS indirection table
and RSS random key.

Currently we support only 82599 and x540 devices.  On those devices, VFs
share the RSS redirection table and hash key with a PF.  Letting the VF
query this information may introduce some security risks, therefore this
feature will be disabled by default.

The new netdev op allows a system administrator to change the default
behaviour with "ip link set" command.  The relevant iproute2 patch has
already been sent and awaits for this series upstream.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:36:57 -04:00
WANG Cong
7a6c8c34e5 fou: implement FOU_CMD_GET
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong
02d793c5bb fou: add network namespace support
Also convert the spinlock to a mutex.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong
4cbcdf2b6c fou: always use be16 for port
udp_config.local_udp_port is be16. And iproute2 passes
network order for FOU_ATTR_PORT.

This doesn't fix any bug, just for consistency.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong
67270636a8 fou: exit early when parsing config fails
Not a big deal, just for corretness.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong
9272f04872 fou: avoid calling udp_del_offload() twice
This fixes the following harmless warning:

./ip/ip fou del port 7777
[  122.907516] udp_del_offload: didn't find offload for port 7777

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
Eric Dumazet
52db70dca5 tcp: do not cache align timewait sockets
With recent adoption of skc_cookie in struct sock_common,
struct tcp_timewait_sock size increased from 192 to 200 bytes
on 64bit arches. SLAB rounds then to 256 bytes.

It is time to drop SLAB_HWCACHE_ALIGN constraint for twsk_slab.

This saves about 12 MB of memory on typical configuration reaching
262144 timewait sockets, and has no noticeable impact on performance.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:16:05 -04:00
David S. Miller
4e78eb0dbf There isn't much left, but we have
* new mac80211 internal software queue to allow drivers to have
    shorter hardware queues and pull on-demand
  * use rhashtable for mac80211 station table
  * minstrel rate control debug improvements and some refactoring
  * fix noisy message about TX power reduction
  * fix continuous message printing and activity if CRDA doesn't respond
  * fix VHT-related capabilities with "iw connect" or "iwconfig ..."
  * fix Kconfig for cfg80211 wireless extensions compatibility
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVJ7CPAAoJEDBSmw7B7bqr8+IQAKCAbUyd6PFRT5tcz9kW5GCW
 /ibb+n1e14yWKgNEe1gddUGKG/L3HGCBXNkCYzR2M8mlL7dLPqspaBcGHS4dx8F4
 D0AuikqvtXIxfAXi0zmU2uo7rH7u2X34R2LtS8AlKByD+jmFvxMiPPvxNFgzJu/7
 63UQm73p2pnu/KdXLW1OQEcpZtZJ9+N/uBiq9zbVdX3A8T84ME0oMyy+EAQqCZdK
 CcsTXHCnAgmmXWJlu1JRdopr1bd38mSGB70eXduFtPqDdmtQRnoaCQ9e+tJDA4j4
 svEw0yDmsc4WG1EKLKKCRd3uFOZsng+lcXrHfpm5wlSPpCOItfQ9BzT3x1u6Y5JU
 Z1WMOMkkEce+95U7/RLoXwC/2RS3XelUXTde4cGIRMvO5drOrU58P0gdn3J+yKbv
 6v+2GGKy/39tdXUOxIl3EZT/huIl+h1UNO8C2hyaEwdXK+X1zl31/u6kk1Ns18Wr
 YPEJixxHx0zR8jaZgDC7OlWLuqn4Ay+Ls9yCyIesdHzKpizJKqn83PntYnpJmxoA
 9hlIyRDWnqH44KxzB85ni1C2Qudec3mcCWIWV7M+UoSC1Cgs/LxDzH7kRejR2ZIl
 vRhg5pqyr53L0h2lq5DO4Cj4UzbXb7YioKJRxjyKloNOlRrCZtK/VEsHbdsKEcIp
 d/wHj1AyFZeQfuhk8Qqr
 =mtuo
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2015-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
There isn't much left, but we have
 * new mac80211 internal software queue to allow drivers to have
   shorter hardware queues and pull on-demand
 * use rhashtable for mac80211 station table
 * minstrel rate control debug improvements and some refactoring
 * fix noisy message about TX power reduction
 * fix continuous message printing and activity if CRDA doesn't respond
 * fix VHT-related capabilities with "iw connect" or "iwconfig ..."
 * fix Kconfig for cfg80211 wireless extensions compatibility
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 20:43:46 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
dcdbd7b269 net/9p: remove (now-)unused helpers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:28:29 -04:00
Al Viro
21c9f5ccb1 p9_client_attach(): set fid->uid correctly
it's almost always equal to current_fsuid(), but there's an exception -
if the first writeback fid is opened by non-root *and* that happens before
root has done any lookups in /, we end up doing attach for root.  The
current code leaves the resulting FID owned by root from the server POV
and by non-root from the client one.  Unfortunately, it means that e.g.
massive dcache eviction will leave that user buggered - they'll end
up redoing walks from / *and* picking that FID every time.  As soon as
they try to create something, the things will get nasty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:28:28 -04:00
Al Viro
e1200fe68f 9p: switch p9_client_read() to passing struct iov_iter *
... and make it loop

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:28:27 -04:00
Al Viro
070b3656cf 9p: switch p9_client_write() to passing it struct iov_iter *
... and make it loop until it's done

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:28:25 -04:00
Al Viro
4f3b35c157 net/9p: switch the guts of p9_client_{read,write}() to iov_iter
... and have get_user_pages_fast() mapping fewer pages than requested
to generate a short read/write.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:28:25 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Al Viro
01e97e6517 new helper: msg_data_left()
convert open-coded instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 15:53:35 -04:00
Al Viro
a2dd3793a1 Merge remote-tracking branch 'dh/afs' into for-davem 2015-04-11 15:51:09 -04:00
Al Viro
d8725c86ae get rid of the size argument of sock_sendmsg()
it's equal to iov_iter_count(&msg->msg_iter) in all cases

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 15:27:37 -04:00
Vlad Zolotarov
01a3d79681 if_link: Add an additional parameter to ifla_vf_info for RSS querying
Add configuration setting for drivers to allow/block an RSS Redirection
Table and a Hash Key querying for discrete VFs.

On some devices VF share the mentioned above information with PF and
querying it may adduce a theoretical security risk. We want to let a
system administrator to decide if he/she wants to take this risk or not.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-04-10 21:57:22 -07:00
Thomas Graf
78ebb0d00b rtnetlink: Mark name argument of rtnl_create_link() const
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 12:42:40 -07:00
David S. Miller
c3d0dac693 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-04-09

We've had enough new patches during the past week (especially from
Marcel) that it'd be good to still get these queued for 4.1.

The majority of the changes are from Marcel with lots of cleanup &
refactoring patches for the HCI UART driver. Marcel also split out some
Broadcom & Intel vendor specific functionality into two new btintel &
btbcm modules.

In addition to the HCI driver changes there's the completion of our
local OOB data interface for pairing, added support for requesting
remote LE features when connecting, as well as a couple of minor fixes
for mac802154.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 18:31:50 -04:00
Eric Dumazet
b52e69217b tcp: md5: fix a typo in tcp_v4_md5_lookup()
Lookup key for tcp_md5_do_lookup() has to be taken
from addr_sk, not sk (which can be the listener)

Fixes: fd3a154a00 ("tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 18:11:16 -04:00
Hubert Sokolowski
1e53d5bb88 net: Pass VLAN ID to rtnl_fdb_notify.
When an FDB entry is added or deleted the information about VLAN
is not passed to listening applications like 'bridge monitor fdb'.
With this patch VLAN ID is passed if it was set in the original
netlink message.

Also remove an unused bdev variable.

Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 17:30:58 -04:00
Eric Dumazet
b50edd7812 tcp: tcp_make_synack() should clear skb->tstamp
I noticed tcpdump was giving funky timestamps for locally
generated SYNACK messages on loopback interface.

11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S
945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>

20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S
3160535375:3160535375(0) ack 945476043 win 43690 <mss
65495,nop,nop,sackOK,nop,wscale 7>

This is because we need to clear skb->tstamp before
entering lower stack, otherwise net_timestamp_check()
does not set skb->tstamp.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 17:28:57 -04:00
Jesse Gross
b736a623bd udptunnels: Call handle_offloads after inserting vlan tag.
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.

Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:56:32 -04:00
David S. Miller
ca69d7102f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree.
They are:

* nf_tables set timeout infrastructure from Patrick Mchardy.

1) Add support for set timeout support.

2) Add support for set element timeouts using the new set extension
   infrastructure.

4) Add garbage collection helper functions to get rid of stale elements.
   Elements are accumulated in a batch that are asynchronously released
   via RCU when the batch is full.

5) Add garbage collection synchronization helpers. This introduces a new
   element busy bit to address concurrent access from the netlink API and the
   garbage collector.

5) Add timeout support for the nft_hash set implementation. The garbage
   collector peridically checks for stale elements from the workqueue.

* iptables/nftables cgroup fixes:

6) Ignore non full-socket objects from the input path, otherwise cgroup
   match may crash, from Daniel Borkmann.

7) Fix cgroup in nf_tables.

8) Save some cycles from xt_socket by skipping packet header parsing when
   skb->sk is already set because of early demux. Also from Daniel.

* br_netfilter updates from Florian Westphal.

9) Save frag_max_size and restore it from the forward path too.

10) Use a per-cpu area to restore the original source MAC address when traffic
    is DNAT'ed.

11) Add helper functions to access physical devices.

12) Use these new physdev helper function from xt_physdev.

13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
    state information.

14) Annotate original layer 2 protocol number in nf_bridge info, instead of
    using kludgy flags.

15) Also annotate the pkttype mangling when the packet travels back and forth
    from the IP to the bridge layer, instead of using a flag.

* More nf_tables set enhancement from Patrick:

16) Fix possible usage of set variant that doesn't support timeouts.

17) Avoid spurious "set is full" errors from Netlink API when there are pending
    stale elements scheduled to be released.

18) Restrict loop checks to set maps.

19) Add support for dynamic set updates from the packet path.

20) Add support to store optional user data (eg. comments) per set element.

BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:46:04 -04:00
David S. Miller
9399bdcbb5 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2015-04-09

1) Prohibit the use/abuse of the xfrm netlink interface on
   32/64 bit compatibility tasks. We need a full compat
   layer before we can allow this. From Fan Du.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:41:47 -04:00
David S. Miller
634d8ee4de Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-04-09

1) We dereferenced the xfrm outer_mode too early, larval
   SAs don't have it set. Move the dereference of the
   outer mode below the larval SA check to fix it.
   From Alexey Dobriyan.

2) Fix vti6 tunnel uninit on namespace crosssing.
   From Yao Xiwei.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:39:37 -04:00
Marcel Holtmann
0fe29fd1cd Bluetooth: Read LE remote features during connection establishment
When establishing a Bluetooth LE connection, read the remote used
features mask to determine which features are supported. This was
not really needed with Bluetooth 4.0, but since Bluetooth 4.1 and
also 4.2 have introduced new optional features, this becomes more
important.

This works the same as with BR/EDR where the connection enters the
BT_CONFIG stage and hci_connect_cfm call is delayed until the remote
features have been retrieved. Only after successfully receiving the
remote features, the connection enters the BT_CONNECTED state.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-04-09 08:36:54 +03:00
Al Viro
6aa248145a switch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()
For kernel_sendmsg() that eliminates the need to play with setfs();
for kernel_recvmsg() it does *not* - a couple of callers are using
it with non-NULL ->msg_control, which would be treated as userland
address on recvmsg side of things.

In all cases we are really setting a kvec-backed iov_iter, though.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:34 -04:00
Al Viro
da18428498 net: switch importing msghdr from userland to {compat_,}import_iovec()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:26 -04:00
Al Viro
602bd0e90e net: switch sendto() and recvfrom() to import_single_range()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:21 -04:00
Al Viro
237dae8890 Merge branch 'iocb' into for-davem
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:01:38 -04:00
Eric Dumazet
dd929c1b3d tcp: do not rearm rsk_timer on FastOpen requests
FastOpen requests are not like other regular request sockets.

They do not yet use rsk_timer : tcp_fastopen_queue_check()
simply manually removes one expired request from fastopenq->rskq_rst
list.

Therefore, tcp_check_req() must not call mod_timer_pending(),
otherwise we crash because rsk_timer was not initialized.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 20:19:42 -04:00
David S. Miller
82740b9ac2 NFC: 4.1 pull request
This is the NFC pull request for 4.1.
 
 This is a shorter one than usual, as the Intel Field Peak NFC
 driver could not make it in time.
 
 We have:
 
 - A new driver for NXP NCI based chipsets, like e.g. the NPC100 or
   the PN7150. It currently only supports an i2c physical layer, but
   could easily be extended to work on top of e.g. SPI.
   This driver also includes support for user space triggered firmware
   updates.
 
 - A few minor st21nfc[ab] fixes, cleanups, and comments improvements.
 
 - A pn533 error return fix.
 
 - A few NFC related logs formatting cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVJV81AAoJEIqAPN1PVmxKM3sP/R6fKXMxtVxXsiPzBMk+SpBI
 onNXbCx27rp1lHTFznE16JAaaSfcQEwSQmYx7xa6KXqRYVlDfRfC+5R+rPGrCjfH
 kV/eLBNnYpmPw0ViVT7dsWK0b4me0k5pr9ki9mze3YxvuMbA5vdv0tvuRFz5IRF/
 hl+WI5pntGuRtnIyBKIauAMylylUYVvCBGhmHnveiX0Dp8noJLBSV0wvdzujm51S
 +Uio/jHlUEV2lotrQBOfWNtEkwonXSwzZWSzimBCyEGLAwTx4lGXHmQftOtPd3zE
 sOT7Gw77ZCsulHoHiJyC0KpDSDS0NYVrtTI5BiuGgAivGi0YZw2XD3CYRIdy0m2Z
 lMoodYdiCgsxmrU6I6Rd/7DQBxD0Vhc+CGAyk41f7AwU4Fwq105kpSLupTtU+NzT
 ZpdvSXeirU8sxt+3WDOgv8esyYGZVVD8GuBbofMZQZ0vLq+FcDpYAW4w3LKvvi6X
 C+WN8f7SCI0kqpd4leyl6EG3SoQKFyPWobu0IlV520R9b76iBcyqooTIpvVa4Yk7
 az6fKhi9gK/T6FHW78y6fnkczd47JKrC924m+g6P/hhTD5zQ956ferp0uFTjkPtF
 8kNgRT7OIRi1JO783cnQx1uA61axC3GX1HFzsD9paVkTwRNtJ3eFsxtcYt7Nfv8D
 WGNWRQp5LmBsD/SqFBfk
 =MzOJ
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC: 4.1 pull request

This is the NFC pull request for 4.1.

This is a shorter one than usual, as the Intel Field Peak NFC
driver could not make it in time.

We have:

- A new driver for NXP NCI based chipsets, like e.g. the NPC100 or
  the PN7150. It currently only supports an i2c physical layer, but
  could easily be extended to work on top of e.g. SPI.
  This driver also includes support for user space triggered firmware
  updates.

- A few minor st21nfc[ab] fixes, cleanups, and comments improvements.

- A pn533 error return fix.

- A few NFC related logs formatting cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 17:12:50 -04:00
Andi Kleen
5eeb292215 fou: Don't use const __read_mostly
const __read_mostly is a senseless combination. If something
is already const it cannot be __read_mostly. Remove the bogus
__read_mostly in the fou driver.

This fixes section conflicts with LTO.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 15:29:08 -04:00
David Miller
c1f8667677 netfilter: Fix switch statement warnings with recent gcc.
More recent GCC warns about two kinds of switch statement uses:

1) Switching on an enumeration, but not having an explicit case
   statement for all members of the enumeration.  To show the
   compiler this is intentional, we simply add a default case
   with nothing more than a break statement.

2) Switching on a boolean value.  I think this warning is dumb
   but nevertheless you get it wholesale with -Wswitch.

This patch cures all such warnings in netfilter.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 15:20:50 -04:00
Sowmini Varadhan
443be0e5af RDS: make sure not to loop forever inside rds_send_xmit
If a determined set of concurrent senders keep the send queue full,
we can loop forever inside rds_send_xmit.  This fix has two parts.

First we are dropping out of the while(1) loop after we've processed a
large batch of messages.

Second we add a generation number that gets bumped each time the
xmit bit lock is acquired.  If someone else has jumped in and
made progress in the queue, we skip our goto restart.

Original patch by Chris Mason.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 15:17:32 -04:00
Sowmini Varadhan
1789b2c077 RDS: only use passive connections when addresses match
Passive connections were added for the case where one loopback IB
connection between identical addresses needs another connection to store
the second QP.  Unfortunately, they were also created in the case where
the addesses differ and we already have both QPs.

This lead to a message reordering bug.

- two different IB interfaces and addresses on a machine: A B
- traffic is sent from A to B
- connection from A-B is created, connect request sent
- listening accepts connect request, B-A is created
- traffic flows, next_rx is incremented
- unacked messages exist on the retrans list
- connection A-B is shut down, new connect request sent
- listen sees existing loopback B-A, creates new passive B-A
- retrans messages are sent and delivered because of 0 next_rx

The problem is that the second connection request saw the previously
existing parent connection.  Instead of using it, and using the existing
next_rx_seq state for the traffic between those IPs, it mistakenly
thought that it had to create a passive connection.

We fix this by only using passive connections in the special case where
laddr and faddr match.  In this case we'll only ever have one parent
sending connection requests and one passive connection created as the
listening path sees the existing parent connection which initiated the
request.

Original patch by Zach Brown

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 15:17:23 -04:00
Pablo Neira Ayuso
aadd51aa71 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflicts between 5888b93 ("Merge branch 'nf-hook-compress'") and
Florian Westphal br_netfilter works.

Conflicts:
        net/bridge/br_netfilter.c

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 18:30:21 +02:00
Hannes Frederic Sowa
1b11287118 ipv6: call iptunnel_xmit with NULL sock pointer if no tunnel sock is available
Fixes: 79b16aadea ("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 12:09:43 -04:00
Hannes Frederic Sowa
926a882f69 ipv4: ip_tunnel: use net namespace from rtable not socket
The socket parameter might legally be NULL, thus sock_net is sometimes
causing a NULL pointer dereference. Using net_device pointer in dst_entry
is more reliable.

Fixes: b6a7719aed ("ipv4: hash net ptr into fragmentation bucket selection")
Reported-by: Rick Jones <rick.jones2@hp.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 12:09:42 -04:00
Patrick McHardy
68e942e88a netfilter: nf_tables: support optional userdata for set elements
Add an userdata set extension and allow the user to attach arbitrary
data to set elements. This is intended to hold TLV encoded data like
comments or DNS annotations that have no meaning to the kernel.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:58:27 +02:00
Patrick McHardy
22fe54d5fe netfilter: nf_tables: add support for dynamic set updates
Add a new "dynset" expression for dynamic set updates.

A new set op ->update() is added which, for non existant elements,
invokes an initialization callback and inserts the new element.
For both new or existing elements the extenstion pointer is returned
to the caller to optionally perform timer updates or other actions.

Element removal is not supported so far, however that seems to be a
rather exotic need and can be added later on.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:58:27 +02:00
Patrick McHardy
11113e190b netfilter: nf_tables: support different set binding types
Currently a set binding is assumed to be related to a lookup and, in
case of maps, a data load.

In order to use bindings for set updates, the loop detection checks
must be restricted to map operations only. Add a flags member to the
binding struct to hold the set "action" flags such as NFT_SET_MAP,
and perform loop detection based on these.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:58:27 +02:00
Patrick McHardy
3dd0673ac3 netfilter: nf_tables: prepare set element accounting for async updates
Use atomic operations for the element count to avoid races with async
updates.

To properly handle the transactional semantics during netlink updates,
deleted but not yet committed elements are accounted for seperately and
are treated as being already removed. This means for the duration of
a netlink transaction, the limit might be exceeded by the amount of
elements deleted. Set implementations must be prepared to handle this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:58:27 +02:00
Patrick McHardy
4a8678efbe netfilter: nf_tables: fix set selection when timeouts are requested
The NFT_SET_TIMEOUT flag is ignore in nft_select_set_ops, which may
lead to selection of a set implementation that doesn't actually
support timeouts.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:58:26 +02:00
Florian Westphal
a1e67951e6 netfilter: bridge: make BRNF_PKT_TYPE flag a bool
nf_bridge_info->mask is used for several things, for example to
remember if skb->pkt_type was set to OTHER_HOST.

For a bridge, OTHER_HOST is expected case. For ip forward its a non-starter
though -- routing expects PACKET_HOST.

Bridge netfilter thus changes OTHER_HOST to PACKET_HOST before hook
invocation and then un-does it after hook traversal.

This information is irrelevant outside of br_netfilter.

After this change, ->mask now only contains flags that need to be
known outside of br_netfilter in fast-path.

Future patch changes mask into a 2bit state field in sk_buff, so that
we can remove skb->nf_bridge pointer for good and consider all remaining
places that access nf_bridge info content a not-so fastpath.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:12 +02:00
Florian Westphal
3eaf402502 netfilter: bridge: start splitting mask into public/private chunks
->mask is a bit info field that mixes various use cases.

In particular, we have flags that are mutually exlusive, and flags that
are only used within br_netfilter while others need to be exposed to
other parts of the kernel.

Remove BRNF_8021Q/PPPoE flags.  They're mutually exclusive and only
needed within br_netfilter context.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:11 +02:00
Florian Westphal
383307838d netfilter: bridge: add and use nf_bridge_info_get helper
Don't access skb->nf_bridge directly, this pointer will be removed soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:10 +02:00
Florian Westphal
a99074ae1f netfilter: physdev: use helpers
Avoid skb->nf_bridge accesses where possible.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:09 +02:00
Florian Westphal
c737b7c451 netfilter: bridge: add helpers for fetching physin/outdev
right now we store this in the nf_bridge_info struct, accessible
via skb->nf_bridge.  This patch prepares removal of this pointer from skb:

Instead of using skb->nf_bridge->x, we use helpers to obtain the in/out
device (or ifindexes).

Followup patches to netfilter will then allow nf_bridge_info to be
obtained by a call into the br_netfilter core, rather than keeping a
pointer to it in sk_buff.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:08 +02:00
Florian Westphal
e70deecbf8 netfilter: bridge: don't use nf_bridge_info data to store mac header
br_netfilter maintains an extra state, nf_bridge_info, which is attached
to skb via skb->nf_bridge pointer.

Amongst other things we use skb->nf_bridge->data to store the original
mac header for every processed skb.

This is required for ip refragmentation when using conntrack
on top of bridge, because ip_fragment doesn't copy it from original skb.

However there is no need anymore to do this unconditionally.

Move this to the one place where its needed -- when br_netfilter calls
ip_fragment().

Also switch to percpu storage for this so we can handle fragmenting
without accessing nf_bridge meta data.

Only user left is neigh resolution when DNAT is detected, to hold
the original source mac address (neigh resolution builds new mac header
using bridge mac), so rename ->data and reduce its size to whats needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:07 +02:00