Use a generic MMIO clocksource and get rid of some lines of code.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Implement a oneshot-capable clockevents device so we get support for
things like hrtimers and NOHZ.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Delete headers which do nothing but include the asm-generic versions and
use Kbuild magic instead.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
The generic atomic bitops are the same as the CRIS-specific ones.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
All of these are either unused or already provided by other headers, so
they can be removed.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
The CRIS SMP code cannot be built since there is no (and appears to
never have been) a CONFIG_SMP Kconfig option in arch/cris/. Remove it.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
INIT_THREAD enables interrupts in the thread_struct's saved flags. This
means that interrupts get enabled in the middle of context_switch()
while switching to new tasks that get forked off the init task during
boot. Don't do this.
Fixes the following splat on boot with spinlock debugging on:
BUG: spinlock cpu recursion on CPU#0, swapper/2
lock: runqueues+0x0/0x47c, .magic: dead4ead, .owner: swapper/0,
.owner_cpu: 0
CPU: 0 PID: 2 Comm: swapper Not tainted 3.19.0-08796-ga747b55 #285
Call Trace:
[<c0032b80>] spin_bug+0x2a/0x36
[<c0032c98>] do_raw_spin_lock+0xa2/0x126
[<c01964b0>] _raw_spin_lock+0x20/0x2a
[<c00286c8>] scheduler_tick+0x22/0x76
[<c003db2c>] update_process_times+0x5e/0x72
[<c0007a94>] timer_interrupt+0x4e/0x6a
[<c00378d6>] handle_irq_event_percpu+0x54/0xf2
[<c00379c4>] handle_irq_event+0x50/0x74
[<c003988e>] handle_simple_irq+0x6c/0xbe
[<c0037270>] generic_handle_irq+0x2a/0x36
[<c0004c40>] do_IRQ+0x38/0x84
[<c000662e>] crisv32_do_IRQ+0x54/0x60
[<c0006204>] IRQ0x4b_interrupt+0x34/0x3c
[<c0192baa>] __schedule+0x24a/0x532
[<c00056b4>] ret_from_kernel_thread+0x0/0x14
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Al Viro noted that CRIS fails to handle multiple signals.
This fixes the problem for CRISv32 by making it use a C work_pending
handling loop similar to the ARM implementation in 0a267fa6a1
("ARM: 7472/1: pull all work_pending logics into C function").
This also happens to fixes the warnings which currently trigger on
CRISv32 due to do_signal() being called with interrupts disabled.
Test case (should die of the SIGSEGV which gets raised when setting up
the stack for SIGALRM, but instead reaches and executes the _exit(1)):
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <err.h>
static void handler(int sig) { }
int main(int argc, char *argv[])
{
int ret;
struct itimerval t1 = { .it_value = {1} };
stack_t ss = {
.ss_sp = NULL,
.ss_size = SIGSTKSZ,
};
struct sigaction action = {
.sa_handler = handler,
.sa_flags = SA_ONSTACK,
};
ret = sigaltstack(&ss, NULL);
if (ret < 0)
err(1, "sigaltstack");
sigaction(SIGALRM, &action, NULL);
setitimer(ITIMER_REAL, &t1, NULL);
pause();
_exit(1);
return 0;
}
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/r/20121208074429.GC4939@ZenIV.linux.org.uk
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Al Viro noted that CRIS is vulnerable to bogus restarts on sigreturn.
The fixes CRISv32 by using regs->exs as an additional indicator to
whether we should attempt to restart the syscall or not. EXS is only
used in the sigtrap handling, and in that path we already have r9 (the
other indicator, which indicates if we're in a syscall or not) cleared.
Test case, a port of Al's ARM version from 653d48b221 ("arm: fix
really nasty sigreturn bug"):
#include <unistd.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/time.h>
#include <errno.h>
void f(int n)
{
register int r10 asm ("r10") = n;
__asm__ __volatile__(
"ba 1f \n"
"nop \n"
"break 8 \n"
"1: ba . \n"
"nop \n"
:
: "r" (r10)
: "memory");
}
void handler1(int sig) { }
void handler2(int sig) { raise(1); }
void handler3(int sig) { exit(0); }
int main(int argc, char *argv[])
{
struct sigaction s = {.sa_handler = handler2};
struct itimerval t1 = { .it_value = {1} };
struct itimerval t2 = { .it_value = {2} };
signal(1, handler1);
sigemptyset(&s.sa_mask);
sigaddset(&s.sa_mask, 1);
sigaction(SIGALRM, &s, NULL);
signal(SIGVTALRM, handler3);
setitimer(ITIMER_REAL, &t1, NULL);
setitimer(ITIMER_VIRTUAL, &t2, NULL);
f(-513); /* -ERESTARTNOINTR */
return 0;
}
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/r/20121208074429.GC4939@ZenIV.linux.org.uk
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
r9 is used to determine whether syscall restarting must be performed or
not. Unfortunately, r9 is never set to zero in the non-syscall path,
and r9 is on top of that a callee-saved register which can be set to
non-zero by the C functions that are called during IRQ handling.
This means that if r10 (used for the syscall return value) is one of the
-ERESTART* values when a hardware interrupt occurs which leads to a
signal being delivered to the process, the kernel will "restart" a
syscall which never occurred. This will lead to the PC being moved back
by 2 on return to user space.
Fix the problem by setting r9 to zero in the interrupt path.
Test case (should loop forever but ends up executing the break 8 trap
instruction):
#include <signal.h>
#include <stdlib.h>
#include <sys/time.h>
void f(int n)
{
register int r9 asm ("r9") = 1;
register int r10 asm ("r10") = n;
__asm__ __volatile__(
"ba 1f \n"
"nop \n"
"break 8 \n"
"1: ba . \n"
"nop \n"
:
: "r" (r9), "r" (r10)
: "memory");
}
void handler1(int sig) { }
int main(int argc, char *argv[])
{
struct itimerval t1 = { .it_value = {1} };
signal(SIGALRM, handler1);
setitimer(ITIMER_REAL, &t1, NULL);
f(-513); /* -ERESTARTNOINTR */
return 0;
}
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Add a minimal device tree for the ETRAX FS SoC and the Axis 88 developer
board.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jespern@axis.com>
Add support for booting CRISv32 with a built-in device tree.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Add support for IRQ domains to the CRISv32 interrupt controller.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Enable GPIOLIB on CRIS so that we can use the generic GPIO APIs.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
While working on arch/cris/include/asm/uaccess.h, I noticed
that some macros within this header are made harder to read because they
violate a coding style rule: space is missing after comma.
Fix it up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
virtio wants to read bitwise types from userspace using get_user. At the
moment this triggers sparse errors, since the value is passed through an
integer.
Fix that up using __force.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.
Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.
Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.
It's also a decent simplification, since the restart code is more or less
identical on all architectures.
[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
LKP has triggered a compiler warning after my recent patch "mm: account
pmd page tables to the process":
mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
The code:
> 2857 WARN_ON(mm_nr_pmds(mm) >
2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has
the same type -- int. PUD_SHIFT.
I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
long. On every arch for consistency.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d45 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The time Kconfig expects that NR_CPUS is defined.
This patch removes this config warning:
"kernel/time/Kconfig:163:warning: range is invalid"
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Avoids the warning about:
warning: 'bite_in_progress' defined but not used [-Wunused-variable]
Variable is only used if the Kconfig CONFIG_ETRAX_WATCHDOG_NICE_DOGGY
is set.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Move declaration of waitqueue to beginning of block,
avoids warning about mixing declarations and code.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Allows that symbol to be used in modules, and fixes
the following on allmodconfig:
ERROR: "csum_partial_copy_nocheck" [net/ipv6/ipv6.ko] undefined!
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Pull vfs fixes from Al Viro:
"A couple of fixes - deadlock in CIFS and build breakage in cris serial
driver (resurfaced f_dentry in there)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Convert file->f_dentry->d_inode to file_inode()
fix deadlock in cifs_ioctl_clone()
Convert file->f_dentry->d_inode to file_inode() so as to get layered
filesystems right.
Found with: git grep '[.>]f_dentry'
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nothing needs the module pointer any more, and the next patch will
call it from RCU, where the module itself might no longer exist.
Removing the arg is the safest approach.
This just codifies the use of the module_alloc/module_free pattern
which ftrace and bpf use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: x86@kernel.org
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: nios2-dev@lists.rocketboards.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: netdev@vger.kernel.org
Move pinmux alloc/dealloc code into functions that don't take
the spinlock so we can use from code that has the spinlock already.
CRISv32 has no working SMP, so spinlocks becomes a NOP,
so deadlock was never seen.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
- Add free_initrd_mem as found by Guenter Roeck <linux@roeck-us.net>
- Add free_init_pages
- Export empty_zero_page symbol
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Don't enter watchdog handling if we're already in watchdog handling.
Also some minor formatting tweaks.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
strcmp was lost when all other string functions were removed,
but we still have an optimized version for this on CRISv32,
so any driver built as a module would not have access to this symbol.
In a similar manner, we had optimized versions of
csum_partial_copy_from_user and __do_clear_user
but no exported symbols for them, breaking bunch of other drivers
when built as a module.
At the same time, move EXPORT_SYMBOL(__copy_user) and
EXPORT_SYMBOL(__copy_user_zeroing) C-files so it's
located together with the function definition.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Fix headers_install by adjusting the path to arch files.
And delete unused Kbuild file.
Drop special handling of cris in the headers.sh script
as a nice side-effect.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Fixes the following compile error.
arch/cris/arch-v32/kernel/time.c: In function 'reset_watchdog':
arch/cris/arch-v32/kernel/time.c:121:2:
error: implicit declaration of function 'global_page_state'
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.
This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu
2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.
3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.
4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.
5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.
6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.
7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.
8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.
9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.
10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.
12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.
14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.
16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.
17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.
18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.
19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.
20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.
21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
As there are now no remaining users of arch_fast_hash(), lets kill
it entirely.
This basically reverts commit 71ae8aac3e ("lib: introduce arch
optimized hash library") and follow-up work, that is f.e., commit
237217546d ("lib: hash: follow-up fixups for arch hash"),
commit e3fec2f74f ("lib: Add missing arch generic-y entries for
asm-generic/hash.h") and last but not least commit 6a02652df5
("perf tools: Fix include for non x86 architectures").
Cc: Francesco Fusco <fusco@ntop.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce new setsockopt() command:
setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.
The same eBPF program can be attached to multiple sockets.
User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>