Commit Graph

425 Commits

Author SHA1 Message Date
Paolo Abeni
e427cad6ee net: datagram: drop 'destructor' argument from several helpers
The only users for such argument are the UDP protocol and the UNIX
socket family. We can safely reclaim the accounted memory directly
from the UDP code and, after the previous patch, we can do scm
stats accounting outside the datagram helpers.

Overall this cleans up a bit some datagram-related helpers, and
avoids an indirect call per packet in the UDP receive path.

v1 -> v2:
 - call scm_stat_del() only when not peeking - Kirill
 - fix build issue with CONFIG_INET_ESPINTCP

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:12:53 -08:00
Paolo Abeni
7782040b95 unix: uses an atomic type for scm files accounting
So the scm_stat_{add,del} helper can be invoked with no
additional lock held.

This clean-up the code a bit and will make the next
patch easier.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:12:53 -08:00
David S. Miller
9f6e055907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The mptcp conflict was overlapping additions.

The SMC conflict was an additional and removal happening at the same
time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 18:31:39 -08:00
David S. Miller
5c05a164d4 unix: It's CONFIG_PROC_FS not CONFIG_PROCFS
Fixes: 3a12500ed5 ("unix: define and set show_fdinfo only if procfs is enabled")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:52:35 -08:00
Tobias Klauser
3a12500ed5 unix: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define unix_show_fdinfo and set it in proto_ops if CONFIG_PROCFS
is set.

Fixes: 3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:11:57 -08:00
Jules Irenge
48851e9e80 af_unix: Add missing annotation for unix_wait_for_peer()
Sparse reports a warning unix_wait_for_peer()

warning: context imbalance in unix_wait_for_peer() - unexpected unlock

The root cause is the missing annotation at unix_wait_for_peer()
Add the missing annotation __releases(&unix_sk(other)->lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Qian Cai
86b18aaa2b skbuff: fix a data race in skb_queue_len()
sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

 read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
  unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
				 net/unix/af_unix.c:1761
  ____sys_sendmsg+0x33e/0x370
  ___sys_sendmsg+0xa6/0xf0
  __sys_sendmsg+0x69/0xf0
  __x64_sys_sendmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
  __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
  __skb_try_recv_datagram+0xbe/0x220
  unix_dgram_recvmsg+0xee/0x850
  ____sys_recvmsg+0x1fb/0x210
  ___sys_recvmsg+0xa2/0xf0
  __sys_recvmsg+0x66/0xf0
  __x64_sys_recvmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, it could introduce a logic
bug in unix_recvq_full() due to the load tearing. Fix it by adding
a lockless variant of skb_queue_len() and unix_recvq_full() where
READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
the commit d7d16a8935 ("net: add skb_queue_empty_lockless()").

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-06 13:59:10 +01:00
David S. Miller
4f2c17e0f3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-01-21

1) Add support for TCP encapsulation of IKE and ESP messages,
   as defined by RFC 8229. Patchset from Sabrina Dubroca.

Please note that there is a merge conflict in:

net/unix/af_unix.c

between commit:

3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")

from the net-next tree and commit:

b50b0580d2 ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram")

from the ipsec-next tree.

The conflict can be solved as done in linux-next.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 12:18:20 +01:00
David S. Miller
ac80010fc9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Mere overlapping changes in the conflicts here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-22 15:15:05 -08:00
Kirill Tkhai
3c32da19a8 unix: Show number of pending scm files of receive queue in fdinfo
Unix sockets like a block box. You never know what is stored there:
there may be a file descriptor holding a mount or a block device,
or there may be whole universes with namespaces, sockets with receive
queues full of sockets etc.

The patch adds a little debug and accounts number of files (not recursive),
which is in receive queue of a unix socket. Sometimes this is useful
to determine, that socket should be investigated or which task should
be killed to put reference counter on a resourse.

v2: Pass correct argument to lockdep

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-12 17:04:55 -08:00
Pankaj Bharadiya
c593642c8b treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2019-12-09 10:36:44 -08:00
Sabrina Dubroca
b50b0580d2 net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram
This will be used by ESP over TCP to handle the queue of IKE messages.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-12-09 09:59:07 +01:00
Linus Torvalds
0da522107e compat_ioctl: remove most of fs/compat_ioctl.c
As part of the cleanup of some remaining y2038 issues, I came to
 fs/compat_ioctl.c, which still has a couple of commands that need support
 for time64_t.
 
 In completely unrelated work, I spent time on cleaning up parts of this
 file in the past, moving things out into drivers instead.
 
 After Al Viro reviewed an earlier version of this series and did a lot
 more of that cleanup, I decided to try to completely eliminate the rest
 of it and move it all into drivers.
 
 This series incorporates some of Al's work and many patches of my own,
 but in the end stops short of actually removing the last part, which is
 the scsi ioctl handlers. I have patches for those as well, but they need
 more testing or possibly a rewrite.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
 +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
 lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
 dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
 VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
 YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
 m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
 TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
 e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
 bN65armTm7bFFR32Avnu
 =lgCl
 -----END PGP SIGNATURE-----

Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
 "As part of the cleanup of some remaining y2038 issues, I came to
  fs/compat_ioctl.c, which still has a couple of commands that need
  support for time64_t.

  In completely unrelated work, I spent time on cleaning up parts of
  this file in the past, moving things out into drivers instead.

  After Al Viro reviewed an earlier version of this series and did a lot
  more of that cleanup, I decided to try to completely eliminate the
  rest of it and move it all into drivers.

  This series incorporates some of Al's work and many patches of my own,
  but in the end stops short of actually removing the last part, which
  is the scsi ioctl handlers. I have patches for those as well, but they
  need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
  scsi: sd: enable compat ioctls for sed-opal
  pktcdvd: add compat_ioctl handler
  compat_ioctl: move SG_GET_REQUEST_TABLE handling
  compat_ioctl: ppp: move simple commands into ppp_generic.c
  compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
  compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
  compat_ioctl: unify copy-in of ppp filters
  tty: handle compat PPP ioctls
  compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
  compat_ioctl: handle SIOCOUTQNSD
  af_unix: add compat_ioctl support
  compat_ioctl: reimplement SG_IO handling
  compat_ioctl: move WDIOC handling into wdt drivers
  fs: compat_ioctl: move FITRIM emulation into file systems
  gfs2: add compat_ioctl support
  compat_ioctl: remove unused convert_in_user macro
  compat_ioctl: remove last RAID handling code
  compat_ioctl: remove /dev/raw ioctl translation
  compat_ioctl: remove PCI ioctl translation
  compat_ioctl: remove joystick ioctl translation
  ...
2019-12-01 13:46:15 -08:00
David S. Miller
d31e95585c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 13:54:56 -07:00
Eric Dumazet
3ef7cf57c7 net: use skb_queue_empty_lockless() in poll() handlers
Many poll() handlers are lockless. Using skb_queue_empty_lockless()
instead of skb_queue_empty() is more appropriate.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28 13:33:41 -07:00
Arnd Bergmann
5f6beb9e0f af_unix: add compat_ioctl support
The af_unix protocol family has a custom ioctl command (inexplicibly
based on SIOCPROTOPRIVATE), but never had a compat_ioctl handler for
32-bit applications.

Since all commands are compatible here, add a trivial wrapper that
performs the compat_ptr() conversion for SIOCOUTQ/SIOCINQ.  SIOCUNIXFILE
does not use the argument, but it doesn't hurt to also use compat_ptr()
here.

Fixes: ba94f3088b ("unix: add ioctl to open a unix socket file with O_PATH")
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23 17:23:46 +02:00
Vito Caputo
262ce0af81 af_unix: __unix_find_socket_byname() cleanup
Remove pointless return variable dance.

Appears vestigial from when the function did locking as seen in
unix_find_socket_byinode(), but locking is handled in
unix_find_socket_byname() for __unix_find_socket_byname().

Signed-off-by: Vito Caputo <vcaputo@pengaru.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-11 20:40:18 -07:00
David S. Miller
a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
David S. Miller
b4b12b0d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()

On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 10:49:43 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Thomas Gleixner
a85036f66f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 62
Based on 1 normalized pattern(s):

  released under the gpl version 2 or later

and 1 additional normalized pattern(s):

  this program is free software you can redistribute it and or
  modify it under the terms of the gnu general public license
  as published by the free software foundation either version
  2 of the license or at your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071858.828691433@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:36:46 +02:00
Felipe Gasper
cae9910e73 net: Add UNIX_DIAG_UID to Netlink UNIX socket diagnostics.
This adds the ability for Netlink to report a socket's UID along with the
other UNIX diagnostic information that is already available. This will
allow diagnostic tools greater insight into which users control which
socket.

To test this, do the following as a non-root user:

    unshare -U -r bash
    nc -l -U user.socket.$$ &

.. and verify from within that same session that Netlink UNIX socket
diagnostics report the socket's UID as 0. Also verify that Netlink UNIX
socket diagnostics report the socket's UID as the user's UID from an
unprivileged process in a different session. Verify the same from
a root process.

Signed-off-by: Felipe Gasper <felipe@felipegasper.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 10:36:35 -07:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Thomas Gleixner
09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Paolo Abeni
fd69c399c7 datagram: remove rendundant 'peeked' argument
After commit a297569fe0 ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.

Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 09:51:54 -07:00
Linus Torvalds
38e7571c07 io_uring-2019-03-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyAJvAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphb+EACFaKI2HIdjExQ5T7Cxebzwky+Qiro3FV55
 ziW00FZrkJ5g0h4ItBzh/5SDlcNQYZDMlA3s4xzWIMadWl5PjMPq1uJul0cITbSl
 WIJO5hpgNMXeUEhvcXUl6+f/WzpgYUxN40uW8N5V7EKlooaFVfudDqJGlvEv+UgB
 g8NWQYThSG+/e7r9OGwK0xDRVKfpjxVvmqmnDH3DrxKaDgSOwTf4xn1u41wKwfQ3
 3uPfQ+GBeTqt4a2AhOi7K6KQFNnj5Jz5CXYMiOZI2JGtLPcL6dmyBVD7K0a0HUr+
 rs4ghNdd1+puvPGNK4TX8qV0uiNrMctoRNVA/JDd1ZTYEKTmNLxeFf+olfYHlwuK
 K5FRs60/lgNzNkzcUpFvJHitPwYtxYJdB36PyswE1FZP1YviEeVoKNt9W8aIhEoA
 549uj90brfA74eCINGhq98pJqj9CNyCPw3bfi76f5Ej2utwYDb9S5Cp2gfSa853X
 qc/qNda9efEq7ikwCbPzhekRMXZo6TSXtaSmC2C+Vs5+mD1Scc4kdAvdCKGQrtr9
 aoy0iQMYO2NDZ/G5fppvXtMVuEPAZWbsGftyOe15IlMysjRze2ycJV8cFahKEVM9
 uBeXLyH1pqGU/j7ABP4+XRZ/sbHJTwjKJbnXhTgBsdU8XO/CR3U+kRQFTsidKMfH
 Wlo3uH2h2A==
 =p78E
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
2019-03-08 14:48:40 -08:00
Jens Axboe
f4e65870e5 net: split out functions related to registering inflight socket files
We need this functionality for the io_uring file registration, but
we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
m/y.

No functional changes in this patch, just moving code around.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 08:24:23 -07:00
Jens Axboe
2b188cc1bb Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up an io_uring instance for doing async IO. On success,
	returns a file descriptor that the application can mmap to
	gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 08:24:23 -07:00
Al Viro
ae3b564179 missing barriers in some of unix_sock ->addr and ->path accesses
Several u->addr and u->path users are not holding any locks in
common with unix_bind().  unix_state_lock() is useless for those
purposes.

u->addr is assign-once and *(u->addr) is fully set up by the time
we set u->addr (all under unix_table_lock).  u->path is also
set in the same critical area, also before setting u->addr, and
any unix_sock with ->path filled will have non-NULL ->addr.

So setting ->addr with smp_store_release() is all we need for those
"lockless" users - just have them fetch ->addr with smp_load_acquire()
and don't even bother looking at ->path if they see NULL ->addr.

Users of ->addr and ->path fall into several classes now:
    1) ones that do smp_load_acquire(u->addr) and access *(u->addr)
and u->path only if smp_load_acquire() has returned non-NULL.
    2) places holding unix_table_lock.  These are guaranteed that
*(u->addr) is seen fully initialized.  If unix_sock is in one of the
"bound" chains, so's ->path.
    3) unix_sock_destructor() using ->addr is safe.  All places
that set u->addr are guaranteed to have seen all stores *(u->addr)
while holding a reference to u and unix_sock_destructor() is called
when (atomic) refcount hits zero.
    4) unix_release_sock() using ->path is safe.  unix_bind()
is serialized wrt unix_release() (normally - by struct file
refcount), and for the instances that had ->path set by unix_bind()
unix_release_sock() comes from unix_release(), so they are fine.
Instances that had it set in unix_stream_connect() either end up
attached to a socket (in unix_accept()), in which case the call
chain to unix_release_sock() and serialization are the same as in
the previous case, or they never get accept'ed and unix_release_sock()
is called when the listener is shut down and its queue gets purged.
In that case the listener's queue lock provides the barriers needed -
unix_stream_connect() shoves our unix_sock into listener's queue
under that lock right after having set ->path and eventual
unix_release_sock() caller picks them from that queue under the
same lock right before calling unix_release_sock().
    5) unix_find_other() use of ->path is pointless, but safe -
it happens with successful lookup by (abstract) name, so ->path.dentry
is guaranteed to be NULL there.

earlier-variant-reviewed-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20 20:06:28 -08:00
Karsten Graul
89ab066d42 Revert "net: simplify sock_poll_wait"
This reverts commit dd979b4df8.

This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-23 10:57:06 -07:00
Kyeongdon Kim
33c4368ee2 net: fix warning in af_unix
This fixes the "'hash' may be used uninitialized in this function"

net/unix/af_unix.c:1041:20: warning: 'hash' may be used uninitialized in this function [-Wmaybe-uninitialized]
  addr->hash = hash ^ sk->sk_type;

Signed-off-by: Kyeongdon Kim <kyeongdon.kim@lge.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-17 21:57:28 -07:00
Jason Baron
51f7e95187 af_unix: ensure POLLOUT on remote close() for connected dgram socket
Applications use -ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. However, when using connected
dgram unix sockets in a poll/write loop, a final POLLOUT event can be
missed when the remote end closes. Thus, the poll is stuck forever:

          thread 1 (client)                   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
                                       close()
                                        ->unix_release_sock()
                                         ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
     wake_up_interruptible_all)
 -> unix_recvq_full() still is true
                                         ->free all skbs

Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
schedules before the 'free all skbs' step.

This issue was reported as a hang when /dev/log is closed.

The fix is to signal POLLOUT if the socket is marked as SOCK_DEAD, which
means a subsequent write() will get -ECONNREFUSED.

Reported-by: Ian Lance Taylor <iant@golang.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 16:44:19 -07:00
Christoph Hellwig
dd979b4df8 net: simplify sock_poll_wait
The wait_address argument is always directly derived from the filp
argument, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 09:10:25 -07:00
Linus Torvalds
a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
Linus Torvalds
408afb8d78 Merge branch 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio updates from Al Viro:
 "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

  The only thing I'm holding back for a day or so is Adam's aio ioprio -
  his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
  but let it sit in -next for decency sake..."

* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  aio: sanitize the limit checking in io_submit(2)
  aio: fold do_io_submit() into callers
  aio: shift copyin of iocb into io_submit_one()
  aio_read_events_ring(): make a bit more readable
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
  aio: take list removal to (some) callers of aio_complete()
  aio: add missing break for the IOCB_CMD_FDSYNC case
  random: convert to ->poll_mask
  timerfd: convert to ->poll_mask
  eventfd: switch to ->poll_mask
  pipe: convert to ->poll_mask
  crypto: af_alg: convert to ->poll_mask
  net/rxrpc: convert to ->poll_mask
  net/iucv: convert to ->poll_mask
  net/phonet: convert to ->poll_mask
  net/nfc: convert to ->poll_mask
  net/caif: convert to ->poll_mask
  net/bluetooth: convert to ->poll_mask
  net/sctp: convert to ->poll_mask
  net/tipc: convert to ->poll_mask
  ...
2018-06-04 13:57:43 -07:00
Christoph Hellwig
e76cd24d02 net/unix: convert to ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
c350637227 proc: introduce proc_create_net{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release.  All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Cong Wang
3848ec5dc8 af_unix: remove redundant lockdep class
After commit 581319c586 ("net/socket: use per af lockdep classes for sk queues")
sock queue locks now have per-af lockdep classes, including unix socket.
It is no longer necessary to workaround it.

I noticed this while looking at a syzbot deadlock report, this patch
itself doesn't fix it (this is why I don't add Reported-by).

Fixes: 581319c586 ("net/socket: use per af lockdep classes for sk queues")
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:13:40 -04:00
Kirill Tkhai
2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
David S. Miller
f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Tobias Klauser
d4e9a408ef net: af_unix: fix typo in UNIX_SKB_FRAGS_SZ comment
Change "minimun" to "minimum".

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 12:21:45 -05:00
Kirill Tkhai
167f7ac723 net: Convert unix_net_ops
These pernet_operations are just create and destroy
/proc and sysctl entries, and are not touched by
foreign pernet_operations.

So, we are able to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:08 -05:00
Denys Vlasenko
9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds
b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Alexey Dobriyan
96890d6252 net: delete /proc THIS_MODULE references
/proc has been ignoring struct file_operations::owner field for 10 years.
Specifically, it started with commit 786d7e1612
("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
inode->i_fop is initialized with proxy struct file_operations for
regular files:

	-               if (de->proc_fops)
	-                       inode->i_fop = de->proc_fops;
	+               if (de->proc_fops) {
	+                       if (S_ISREG(inode->i_mode))
	+                               inode->i_fop = &proc_reg_file_ops;
	+                       else
	+                               inode->i_fop = de->proc_fops;
	+               }

VFS stopped pinning module at this point.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16 15:01:33 -05:00
Al Viro
ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Al Viro
3ad6f93e98 annotate poll-related wait keys
__poll_t is also used as wait key in some waitqueues.
Verify that wait_..._poll() gets __poll_t as key and
provide a helper for wakeup functions to get back to
that __poll_t value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:54 -05:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Linus Torvalds
ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
David S. Miller
e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
Andrei Vagin
0f5da659d8 net/unix: don't show information about sockets from other namespaces
socket_diag shows information only about sockets from a namespace where
a diag socket lives.

But if we request information about one unix socket, the kernel don't
check that its netns is matched with a diag socket namespace, so any
user can get information about any unix socket in a system. This looks
like a bug.

v2: add a Fixes tag

Fixes: 51d7cccf07 ("net: make sock diag per-namespace")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 10:05:59 +09:00
Gustavo A. R. Silva
110af3acb8 net: af_unix: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 03:07:50 +01:00
David S. Miller
e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Matthew Dawson
a0917e0bc6 datagram: When peeking datagrams with offset < 0 don't skip empty skbs
Due to commit e6afc8ace6 ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header.  However, when the offset is 0 and the next skb is
of length 0, it is only returned once.  The behaviour can be seen with
the following python script:

from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));

Where the expected output should be the empty string twice.

Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
__skb_try_recv_from_queue will then ensure the offset is reset back to 0
if a peek is requested without an offset, unless no packets are found.

Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
greater then 0, and off is greater then or equal to skb->len, then
(_off || skb->len) must always be true assuming skb->len >= 0 is always
true.

Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.

V2:
 - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
redundant checks
 - Fix peeking in udp{,v6}_recvmsg to report the right value when the
offset is 0

V3:
 - Marked new branch in __skb_try_recv_from_queue as unlikely.

Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 15:12:54 -07:00
David Herrmann
27eac47b00 net/unix: drop obsolete fd-recursion limits
All unix sockets now account inflight FDs to the respective sender.
This was introduced in:

    commit 712f4aad40
    Author: willy tarreau <w@1wt.eu>
    Date:   Sun Jan 10 07:54:56 2016 +0100

        unix: properly account for FDs passed over unix sockets

and further refined in:

    commit 415e3d3e90
    Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
    Date:   Wed Feb 3 02:11:03 2016 +0100

        unix: correctly track in-flight fds in sending process user_struct

Hence, regardless of the stacking depth of FDs, the total number of
inflight FDs is limited, and accounted. There is no known way for a
local user to exceed those limits or exploit the accounting.

Furthermore, the GC logic is independent of the recursion/stacking depth
as well. It solely depends on the total number of inflight FDs,
regardless of their layout.

Lastly, the current `recursion_level' suffers a TOCTOU race, since it
checks and inherits depths only at queue time. If we consider `A<-B' to
mean `queue-B-on-A', the following sequence circumvents the recursion
level easily:

    A<-B
       B<-C
          C<-D
             ...
               Y<-Z

resulting in:

    A<-B<-C<-...<-Z

With all of this in mind, lets drop the recursion limit. It has no
additional security value, anymore. On the contrary, it randomly
confuses message brokers that try to forward file-descriptors, since
any sendmsg(2) call can fail spuriously with ETOOMANYREFS if a client
maliciously modifies the FD while inflight.

Cc: Alban Crequy <alban.crequy@collabora.co.uk>
Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 08:57:59 -07:00
Linus Torvalds
5518b69b76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
  merge window:

   1) Several optimizations for UDP processing under high load from
      Paolo Abeni.

   2) Support pacing internally in TCP when using the sch_fq packet
      scheduler for this is not practical. From Eric Dumazet.

   3) Support mutliple filter chains per qdisc, from Jiri Pirko.

   4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

   5) Add batch dequeueing to vhost_net, from Jason Wang.

   6) Flesh out more completely SCTP checksum offload support, from
      Davide Caratti.

   7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
      Neira Ayuso, and Matthias Schiffer.

   8) Add devlink support to nfp driver, from Simon Horman.

   9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
      Prabhu.

  10) Add stack depth tracking to BPF verifier and use this information
      in the various eBPF JITs. From Alexei Starovoitov.

  11) Support XDP on qed device VFs, from Yuval Mintz.

  12) Introduce BPF PROG ID for better introspection of installed BPF
      programs. From Martin KaFai Lau.

  13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

  14) For loads, allow narrower accesses in bpf verifier checking, from
      Yonghong Song.

  15) Support MIPS in the BPF selftests and samples infrastructure, the
      MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
      Daney.

  16) Support kernel based TLS, from Dave Watson and others.

  17) Remove completely DST garbage collection, from Wei Wang.

  18) Allow installing TCP MD5 rules using prefixes, from Ivan
      Delalande.

  19) Add XDP support to Intel i40e driver, from Björn Töpel

  20) Add support for TC flower offload in nfp driver, from Simon
      Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
      Kicinski, and Bert van Leeuwen.

  21) IPSEC offloading support in mlx5, from Ilan Tayari.

  22) Add HW PTP support to macb driver, from Rafal Ozieblo.

  23) Networking refcount_t conversions, From Elena Reshetova.

  24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
      for tuning the TCP sockopt settings of a group of applications,
      currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
  net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
  dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
  cxgb4: Support for get_ts_info ethtool method
  cxgb4: Add PTP Hardware Clock (PHC) support
  cxgb4: time stamping interface for PTP
  nfp: default to chained metadata prepend format
  nfp: remove legacy MAC address lookup
  nfp: improve order of interfaces in breakout mode
  net: macb: remove extraneous return when MACB_EXT_DESC is defined
  bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
  bpf: fix return in load_bpf_file
  mpls: fix rtm policy in mpls_getroute
  net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
  net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
  ...
2017-07-05 12:31:59 -07:00
Reshetova, Elena
8c9814b970 net: convert unix_address.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
14afee4b60 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Mateusz Jurczyk
defbcf2dec af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_UNIX socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09 10:10:24 -04:00
Kees Cook
82fe0d2b44 af_unix: Use designated initializers
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, and the initializer fixes
were extracted from grsecurity. In this case, NULL initialize with { }
instead of undesignated NULLs.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:43:04 -07:00
Andrey Ulanov
7df9c24625 net: unix: properly re-increment inflight counter of GC discarded candidates
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[<ffffffff8717ebf4>]  [<ffffffff8717ebf4>]
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
   [<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
   [<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
   [<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
   [<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
   [<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [<     inline     >] exit_task_work include/linux/task_work.h:21
   [<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
   [<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
   [<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 15:25:10 -07:00
David Howells
cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Andrey Vagin
ba94f3088b unix: add ioctl to open a unix socket file with O_PATH
This ioctl opens a file to which a socket is bound and
returns a file descriptor. The caller has to have CAP_NET_ADMIN
in the socket network namespace.

Currently it is impossible to get a path and a mount point
for a socket file. socket_diag reports address, device ID and inode
number for unix sockets. An address can contain a relative path or
a file may be moved somewhere. And these properties say nothing about
a mount namespace and a mount point of a socket file.

With the introduced ioctl, we can get a path by reading
/proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.

In CRIU we are going to use this ioctl to dump and restore unix socket.

Here is an example how it can be used:

$ strace -e socket,bind,ioctl ./test /tmp/test_sock
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
ioctl(3, SIOCUNIXFILE, 0)           = 4
^Z

$ ss -a | grep test_sock
u_str  LISTEN     0      1      test_sock 17798                 * 0

$ ls -l /proc/760/fd/{3,4}
lrwx------ 1 root root 64 Feb  1 09:41 3 -> 'socket:[17798]'
l--------- 1 root root 64 Feb  1 09:41 4 -> /tmp/test_sock

$ cat /proc/760/fdinfo/4
pos:	0
flags:	012000000
mnt_id:	40

$ cat /proc/self/mountinfo | grep "^40\s"
40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw

Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:58:02 -05:00
WANG Cong
0fb44559ff af_unix: move unix_mknod() out of bindlock
Dmitry reported a deadlock scenario:

unix_bind() path:
u->bindlock ==> sb_writer

do_splice() path:
sb_writer ==> pipe->mutex ==> u->bindlock

In the unix_bind() code path, unix_mknod() does not have to
be done with u->bindlock held, since it is a pure fs operation,
so we can just move unix_mknod() out.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:30:56 -05:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
ff0f962ca3 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This update contains:

   - try to clone on copy-up

   - allow renaming a directory

   - split source into managable chunks

   - misc cleanups and fixes

  It does not contain the read-only fd data inconsistency fix, which Al
  didn't like. I'll leave that to the next year..."

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
  ovl: fix reStructuredText syntax errors in documentation
  ovl: fix return value of ovl_fill_super
  ovl: clean up kstat usage
  ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
  ovl: create directories inside merged parent opaque
  ovl: opaque cleanup
  ovl: show redirect_dir mount option
  ovl: allow setting max size of redirect
  ovl: allow redirect_dir to default to "on"
  ovl: check for emptiness of redirect dir
  ovl: redirect on rename-dir
  ovl: lookup redirects
  ovl: consolidate lookup for underlying layers
  ovl: fix nested overlayfs mount
  ovl: check namelen
  ovl: split super.c
  ovl: use d_is_dir()
  ovl: simplify lookup
  ovl: check lower existence of rename target
  ovl: rename: simplify handling of lower/merged directory
  ...
2016-12-16 10:58:12 -08:00
Miklos Szeredi
beef5121f3 Revert "af_unix: fix hard linked sockets on overlay"
This reverts commit eb0a4a47ae.

Since commit 51f7e52dc9 ("ovl: share inode for hard link") there's no
need to call d_real_inode() to check two overlay inodes for equality.

Side effect of this revert is that it's no longer possible to connect one
socket on overlayfs to one on the underlying layer (something which didn't
make sense anyway).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:53 +01:00
David S. Miller
f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
WANG Cong
06a77b07e3 af_unix: conditionally use freezable blocking calls in read
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca10:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 13:58:39 -05:00
David S. Miller
bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Paolo Abeni
7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Isaac Boukris
e7947ea770 unix: escape all null bytes in abstract unix domain socket
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris <iboukris@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 12:15:13 -04:00
Al Viro
25869262ef skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:56 -04:00
Linus Torvalds
6e1ce3c345 af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock'
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324 ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat <rweikusat@cyberadapt.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 13:29:29 -07:00
Linus Torvalds
38f7bd94a9 Revert "af_unix: Fix splice-bind deadlock"
This reverts commit c845acb324.

It turns out that it just replaces one deadlock with another one: we can
still get the wrong lock ordering with the readlock due to overlayfs
calling back into the filesystem layer and still taking the vfs locks
after the readlock.

The proper solution ends up being to just split the readlock into two
pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
(taken *inside* the filesystem locks).  The two locks are independent
anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 13:29:29 -07:00
Vladimir Davydov
3aa9799e13 af_unix: charge buffers to kmemcg
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context, all
we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.

Eric asked:

> 1) What happens when a buffer, allocated from socket <A> lands in a
> different socket <B>, maybe owned by another user/process.
>
> Who owns it now, in term of kmemcg accounting ?

We never move memcg charges.  E.g.  if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first.  Or if two processes are working with
the same directory tree, inodes and dentries will be charged to the
first user.  The same is fair for unix socket buffers - they will be
charged to the sender.

> 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
x2 HT box.  The results are below:

 # clients            bandwidth (10^6bits/sec)
                    base              patched
         1      67643 +-  725      64874 +-  353    - 4.0 %
         4     193585 +- 2516     186715 +- 1460    - 3.5 %
         8     194820 +-  377     187443 +- 1229    - 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Miklos Szeredi
30402c8949 Merge branch 'overlayfs-af_unix-fix' into overlayfs-linus 2016-06-12 12:05:21 +02:00
Miklos Szeredi
eb0a4a47ae af_unix: fix hard linked sockets on overlay
Overlayfs uses separate inodes even in the case of hard links on the
underlying filesystems.  This is a problem for AF_UNIX socket
implementation which indexes sockets based on the inode.  This resulted in
hard linked sockets not working.

The fix is to use the real, underlying inode.

Test case follows:

-- ovl-sock-test.c --
#include <unistd.h>
#include <err.h>
#include <sys/socket.h>
#include <sys/un.h>

#define SOCK "test-sock"
#define SOCK2 "test-sock2"

int main(void)
{
	int fd, fd2;
	struct sockaddr_un addr = {
		.sun_family = AF_UNIX,
		.sun_path = SOCK,
	};
	struct sockaddr_un addr2 = {
		.sun_family = AF_UNIX,
		.sun_path = SOCK2,
	};

	unlink(SOCK);
	unlink(SOCK2);
	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
		err(1, "socket");
	if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
		err(1, "bind");
	if (listen(fd, 0) == -1)
		err(1, "listen");
	if (link(SOCK, SOCK2) == -1)
		err(1, "link");
	if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
		err(1, "socket");
	if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
		err (1, "connect");
	return 0;
}
----

Reported-by: Alexander Morozov <alexandr.morozov@docker.com> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-05-20 22:13:45 +02:00
Al Viro
d360775217 constify security_path_{mkdir,mknod,symlink}
... as well as unix_mknod() and may_o_create()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-28 00:47:27 -04:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Rainer Weikusat
18eceb818d af_unix: Don't use continue to re-execute unix_stream_read_generic loop
The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 23:50:31 -05:00
Dmitry V. Levin
b5f0549231 unix_diag: fix incorrect sign extension in unix_lookup_by_ino
The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
__u32, but unix_lookup_by_ino's argument ino has type int, which is not
a problem yet.
However, when ino is compared with sock_i_ino return value of type
unsigned long, ino is sign extended to signed long, and this results
to incorrect comparison on 64-bit architectures for inode numbers
greater than INT_MAX.

This bug was found by strace test suite.

Fixes: 5d3cae8bc3 ("unix_diag: Dumping exact socket core")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 23:49:23 -05:00
Rainer Weikusat
a5527dda34 af_unix: Guard against other == sk in unix_dgram_sendmsg
The unix_dgram_sendmsg routine use the following test

if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {

to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.

Fixes: 7d267278a9 ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Tested-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 12:53:35 -05:00
Rainer Weikusat
1b92ee3d03 af_unix: Don't set err in unix_stream_read_generic unless there was an error
The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if (<test>)
	goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
	return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Fixes: 3822b5c2fc ("af_unix: Revert 'lock_interruptible' in stream receive code")
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 12:48:04 -05:00
Hannes Frederic Sowa
415e3d3e90 unix: correctly track in-flight fds in sending process user_struct
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.

To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.

Fixes: 712f4aad40 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann <dh.herrmann@gmail.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08 10:30:42 -05:00
David Herrmann
3575dbf2cb net: drop write-only stack variable
Remove a write-only stack variable from unix_attach_fds(). This is a
left-over from the security fix in:

    commit 712f4aad40
    Author: willy tarreau <w@1wt.eu>
    Date:   Sun Jan 10 07:54:56 2016 +0100

        unix: properly account for FDs passed over unix sockets

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:06:26 -05:00
Eric Dumazet
fa0dc04df2 af_unix: fix struct pid memory leak
Dmitry reported a struct pid leak detected by a syzkaller program.

Bug happens in unix_stream_recvmsg() when we break the loop when a
signal is pending, without properly releasing scm.

Fixes: b3ca9b02b0 ("net: fix multithreaded signal handling in unix recv routines")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-24 22:04:49 -08:00
David S. Miller
9d367eddf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/ethernet/mellanox/mlxsw/spectrum.h
	drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 23:55:43 -05:00
willy tarreau
712f4aad40 unix: properly account for FDs passed over unix sockets
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 00:05:30 -05:00
David S. Miller
9e0efaf6b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-01-06 22:54:18 -05:00
Rainer Weikusat
c845acb324 af_unix: Fix splice-bind deadlock
On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
system call and AF_UNIX sockets,

http://lists.openwall.net/netdev/2015/11/06/24

The situation was analyzed as

(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
	does sb_start_write() on /mnt
C: try to freeze /mnt
	wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
	lock our socket, see it not bound yet
	decide that it needs to create something in /mnt
	try to do sb_start_write() on /mnt, block (it's
	waiting for C).
D: splice() from the same pipe to our socket
	lock the pipe, see that socket is connected
	try to lock the socket, block waiting for A
B:	get around to actually feeding a chunk from
	pipe to file, try to lock the pipe.  Deadlock.

on 2015/11/10 by Al Viro,

http://lists.openwall.net/netdev/2015/11/10/4

The patch fixes this by removing the kern_path_create related code from
unix_mknod and executing it as part of unix_bind prior acquiring the
readlock of the socket in question. This means that A (as used above)
will sb_start_write on /mnt before it acquires the readlock, hence, it
won't indirectly block B which first did a sb_start_write and then
waited for a thread trying to acquire the readlock. Consequently, A
being blocked by C waiting for B won't cause a deadlock anymore
(effectively, both A and B acquire two locks in opposite order in the
situation described above).

Dmitry Vyukov(<dvyukov@google.com>) tested the original patch.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 23:22:49 -05:00
David S. Miller
b3e0d3d7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/geneve.c

Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 22:08:28 -05:00
Rainer Weikusat
3822b5c2fc af_unix: Revert 'lock_interruptible' in stream receive code
With b3ca9b02b0, the AF_UNIX SOCK_STREAM
receive code was changed from using mutex_lock(&u->readlock) to
mutex_lock_interruptible(&u->readlock) to prevent signals from being
delayed for an indefinite time if a thread sleeping on the mutex
happened to be selected for handling the signal. But this was never a
problem with the stream receive code (as opposed to its datagram
counterpart) as that never went to sleep waiting for new messages with the
mutex held and thus, wouldn't cause secondary readers to block on the
mutex waiting for the sleeping primary reader. As the interruptible
locking makes the code more complicated in exchange for no benefit,
change it back to using mutex_lock.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:33:47 -05:00
Rainer Weikusat
6487428088 af_unix: fix unix_dgram_recvmsg entry locking
The current unix_dgram_recvsmg code acquires the u->readlock mutex in
order to protect access to the peek offset prior to calling
__skb_recv_datagram for actually receiving data. This implies that a
blocking reader will go to sleep with this mutex held if there's
presently no data to return to userspace. Two non-desirable side effects
of this are that a later non-blocking read call on the same socket will
block on the ->readlock mutex until the earlier blocking call releases it
(or the readers is interrupted) and that later blocking read calls
will wait longer than the effective socket read timeout says they
should: The timeout will only start 'ticking' once such a reader hits
the schedule_timeout in wait_for_more_packets (core.c) while the time it
already had to wait until it could acquire the mutex is unaccounted for.

The patch avoids both by using the __skb_try_recv_datagram and
__skb_wait_for_more packets functions created by the first patch to
implement a unix_dgram_recvmsg read loop which releases the readlock
mutex prior to going to sleep and reacquires it as needed
afterwards. Non-blocking readers will thus immediately return with
-EAGAIN if there's no data available regardless of any concurrent
blocking readers and all blocking readers will end up sleeping via
schedule_timeout, thus honouring the configured socket receive timeout.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-06 23:31:54 -05:00
David S. Miller
f188b951f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/renesas/ravb_main.c
	kernel/bpf/syscall.c
	net/ipv4/ipmr.c

All three conflicts were cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 21:09:12 -05:00