The segment descriptor contains information that is relevant to how linear
addresses need to be computed. It contains the default size of addresses
as well as the base address of the segment. Thus, given a segment
selector, we ought to look at segment descriptor to correctly calculate
the linear address.
In protected mode, the segment selector might indicate a segment
descriptor from either the global descriptor table or a local descriptor
table. Both cases are considered in this function.
This function is a prerequisite for functions in subsequent commits that
will obtain the aforementioned attributes of the segment descriptor.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-15-git-send-email-ricardo.neri-calderon@linux.intel.com
When computing a linear address and segmentation is used, we need to know
the base address of the segment involved in the computation. In most of
the cases, the segment base address will be zero as in USER_DS/USER32_DS.
However, it may be possible that a user space program defines its own
segments via a local descriptor table. In such a case, the segment base
address may not be zero. Thus, the segment base address is needed to
calculate correctly the linear address.
If running in protected mode, the segment selector to be used when
computing a linear address is determined by either any of segment override
prefixes in the instruction or inferred from the registers involved in the
computation of the effective address; in that order. Also, there are cases
when the segment override prefixes shall be ignored (i.e., code segments
are always selected by the CS segment register; string instructions always
use the ES segment register when using rDI register as operand). In long
mode, segment registers are ignored, except for FS and GS. In these two
cases, base addresses are obtained from the respective MSRs.
For clarity, this process can be split into four steps (and an equal
number of functions): determine if segment prefixes overrides can be used;
parse the segment override prefixes, and use them if found; if not found
or cannot be used, use the default segment registers associated with the
operand registers. Once the segment register to use has been identified,
read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In
32-bit builds, segment selectors are saved into a pt_regs structure
when switching to kernel mode. The same is also true for virtual-8086
mode. In 64-bit builds, segmentation is mostly ignored, except when
running a program in 32-bit legacy mode. In this case, CS and SS can be
obtained from pt_regs. DS, ES, FS and GS can be read directly from
the respective segment registers.
In order to identify the segment registers, a new set of #defines is
introduced. It also includes two special identifiers. One of them
indicates when the default segment register associated with instruction
operands shall be used. Another one indicates that the contents of the
segment register shall be ignored; this identifier is used when in long
mode.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-14-git-send-email-ricardo.neri-calderon@linux.intel.com
String instructions are special because, in protected mode, the linear
address is always obtained via the ES segment register in operands that
use the (E)DI register; the DS segment register in operands that use
the (E)SI register. Furthermore, segment override prefixes are ignored
when calculating a linear address involving the (E)DI register; segment
override prefixes can be used when calculating linear addresses involving
the (E)SI register.
It follows that linear addresses are calculated differently for the case of
string instructions. The purpose of this utility function is to identify
such instructions for callers to determine a linear address correctly.
Note that this function only identifies string instructions; it does not
determine what segment register to use in the address computation. That is
left to callers. A subsequent commmit introduces a function to determine
the segment register to use given the instruction, operands and
segment override prefixes.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-13-git-send-email-ricardo.neri-calderon@linux.intel.com
The function get_reg_offset() returns the offset to the register the
argument specifies as indicated in an enumeration of type offset. Callers
of this function would need the definition of such enumeration. This is
not needed. Instead, add helper functions for this purpose. These functions
are useful in cases when, for instance, the caller needs to decide whether
the operand is a register or a memory location by looking at the rm part
of the ModRM byte. As of now, this is the only helper function that is
needed.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-12-git-send-email-ricardo.neri-calderon@linux.intel.com
We are not in a critical failure path. The invalid register type is caused
when trying to decode invalid instruction bytes from a user-space program.
Thus, simply print an error message. To prevent this warning from being
abused from user space programs, use the rate-limited variant of pr_err().
along with a descriptive prefix.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-11-git-send-email-ricardo.neri-calderon@linux.intel.com
Other kernel submodules can benefit from using the utility functions
defined in mpx.c to obtain the addresses and values of operands contained
in the general purpose registers. An instance of this is the emulation code
used for instructions protected by the Intel User-Mode Instruction
Prevention feature.
Thus, these functions are relocated to a new insn-eval.c file. The reason
to not relocate these utilities into insn.c is that the latter solely
analyses instructions given by a struct insn without any knowledge of the
meaning of the values of instruction operands. This new utility insn-
eval.c aims to be used to resolve userspace linear addresses based on
the contents of the instruction operands as well as the contents of pt_regs
structure.
These utilities come with a separate header. This is to avoid taking insn.c
out of sync from the instructions decoders under tools/obj and tools/perf.
This also avoids adding cumbersome #ifdef's for the #include'd files
required to decode instructions in a kernel context.
Functions are simply relocated. There are not functional or indentation
changes.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-10-git-send-email-ricardo.neri-calderon@linux.intel.com
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that if a SIB byte is used and
SIB.base is 101b and ModRM.mod is zero, then the base part of the base
part of the effective address computation is null. To signal this
situation, a -EDOM error is returned to indicate callers to ignore the
base value present in the register operand.
In this scenario, a 32-bit displacement follows the SIB byte. Displacement
is obtained when the instruction decoder parses the operands.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Adan Hawthorn <adanhawthorn@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Nathan Howard <liverlint@gmail.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1509135945-13762-9-git-send-email-ricardo.neri-calderon@linux.intel.com
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that when ModRM.mod !=11b and
ModRM.rm = 100b indexed register-indirect addressing is used. In other
words, a SIB byte follows the ModRM byte. In the specific case of
SIB.index = 100b, the scale*index portion of the computation of the
effective address is null. To signal callers of this particular situation,
get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an
error when decoding the SIB byte).
An example of this situation can be the following instruction:
8b 4c 23 80 mov -0x80(%rbx,%riz,1),%rcx
ModRM: 0x4c [mod:1b][reg:1b][rm:100b]
SIB: 0x23 [scale:0b][index:100b][base:11b]
Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
The %riz 'register' indicates a null index.
In long mode, a REX prefix may be used. When a REX prefix is present,
REX.X adds a fourth bit to the register selection of SIB.index. This gives
the ability to refer to all the 16 general purpose registers. When REX.X is
1b and SIB.index is 100b, the index is indicated in %r12. In our example,
this would look like:
42 8b 4c 23 80 mov -0x80(%rbx,%r12,1),%rcx
REX: 0x42 [W:0b][R:0b][X:1b][B:0b]
ModRM: 0x4c [mod:1b][reg:1b][rm:100b]
SIB: 0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b]
Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
%r12 is a valid register to use in the scale*index part of the effective
address computation.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Adan Hawthorn <adanhawthorn@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Nathan Howard <liverlint@gmail.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1509135945-13762-8-git-send-email-ricardo.neri-calderon@linux.intel.com
Even though memory addresses are unsigned, the operands used to compute the
effective address do have a sign. This is true for ModRM.rm, SIB.base,
SIB.index as well as the displacement bytes. Thus, signed variables shall
be used when computing the effective address from these operands. Once the
signed effective address has been computed, it is casted to an unsigned
long to determine the linear address.
Variables are renamed to better reflect the type of address being
computed.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Adan Hawthorn <adanhawthorn@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Nathan Howard <liverlint@gmail.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1509135945-13762-7-git-send-email-ricardo.neri-calderon@linux.intel.com
When errors occur in the computation of the linear address, -1L is
returned. Rather than having a separate return path for errors, the
variable used to return the computed linear address can be initialized
with the error value. Hence, only one return path is needed. This makes
the function easier to read.
While here, ensure that the error value is -1L, a 64-bit value, rather
than -1, a 32-bit value.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Adan Hawthorn <adanhawthorn@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Nathan Howard <liverlint@gmail.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1509135945-13762-6-git-send-email-ricardo.neri-calderon@linux.intel.com
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64
is selected. This implies that code built with CONFIG_X86_64=n cannot use
it. If a piece of code needs to be built for both CONFIG_X86_64=y and
CONFIG_X86_64=n and wants to use this function, it needs to wrap it in
an #ifdef/#endif; potentially, in multiple places.
This can be easily avoided with a single #ifdef/#endif pair within
user_64bit_mode() itself.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-4-git-send-email-ricardo.neri-calderon@linux.intel.com
Both head_32.S and head_64.S utilize the same value to initialize the
control register CR0. Also, other parts of the kernel might want to access
this initial definition (e.g., emulation code for User-Mode Instruction
Prevention uses this state to provide a sane dummy value for CR0 when
emulating the smsw instruction). Thus, relocate this definition to a
header file from which it can be conveniently accessed.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: linux-mm@kvack.org
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-arch@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/1509135945-13762-3-git-send-email-ricardo.neri-calderon@linux.intel.com
Up to this point, only fault.c used the definitions of the page fault error
codes. Thus, it made sense to keep them within such file. Other portions of
code might be interested in those definitions too. For instance, the User-
Mode Instruction Prevention emulation code will use such definitions to
emulate a page fault when it is unable to successfully copy the results
of the emulated instructions to user space.
While relocating the error code enumeration, the prefix X86_ is used to
make it consistent with the rest of the definitions in traps.h. Of course,
code using the enumeration had to be updated as well. No functional changes
were performed.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/1509135945-13762-2-git-send-email-ricardo.neri-calderon@linux.intel.com
IB device index is nldev's handler and it should be checked always.
Fixes: c3f66f7b00 ("RDMA/netlink: Implement nldev port doit callback")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
[ Applying directly, since Doug fried his SSD's and is rebuilding - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes new breakage introduced by the most recent PM QoS
fix in which, embarrassingly enough, I forgot to update
dev_pm_qos_raw_read_value() to return the right default
for devices with no PM QoS constraints at all which prevents
runtime PM from suspending those devices (fix from Tero Kristo).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZ97EwAAoJEILEb/54YlRxAw8P/Rtl4qgVBGncDCDlmVK7uyaa
nFGSSAdgTVo+WlyZ/ubs1zXj8ra8AhkKIOa3rwtFI21e/DejlqTxBbUHQMqSDPZC
JUfS6F4hkLW86BmAMeDh+tma0oG5abXvhzx1F26/1LRFsAAORW2MQ02GUzQfe04j
iThjk1J+CmfdZ5Lkc6fIWnI5bQRo7WxuYlT1GfthJVPyorTgCviQ/p5I7012HgF1
ZDwCbUWyEereybIMYaXkwXdyP2RltBKDlxRRWe/tC6O3m3VsDHWuO9+HoS6ZFMCH
sicgpovEHhZagJSWdNuEK2HAvE3XoDTtWU3xSRkZpxJSarKxPD/aXZzwGkyg1x6N
8/KmuwsBuNThQLY6ODDxBP9+1ThpbpgtFbx8JMPy5/Usdm/HlIJSHZyeGSc6BWoj
TI8GLEzqhF61aHWM1u5Nxh8v1DsGvAXEXLWKgQlgfInUPaw1PxYv5lCa5r4fhRzr
te3ePtbaMh2bNu0fbJyYgTcteaOz/1DPSc/keYFhhGj/iU39232IzryDZKTbz0Cn
Zb0OVhkpQcr+1e3HPf1aTjHVQcWds4QtTmDmaoushxmBxhdBYnWX8qiwwmSnvW8S
+kkax7UY4EE7FH+8+3cW7OfqQ7vGLqqG+l3pQptf5yeGMb/hwdWidC669Be6g9Bz
tZ0sDywMg2MGcbeRPx/j
=xbGZ
-----END PGP SIGNATURE-----
Merge tag 'pm-urgent-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes new breakage introduced by the most recent PM QoS fix in
which, embarrassingly enough, I forgot to update
dev_pm_qos_raw_read_value() to return the right default for devices
with no PM QoS constraints at all which prevents runtime PM from
suspending those devices (fix from Tero Kristo)"
* tag 'pm-urgent-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / QoS: Fix default runtime_pm device resume latency
It turns out that some drivers seem to think it's ok to remap page
ranges from within interrupts and even NMI's. That is definitely not
the case, since the page table build-up is simply not interrupt-safe.
This showed up in the zero-day robot that reported it for the ACPI APEI
GHES ("Generic Hardware Error Source") driver. Normally it had been
hidden by the fact that no page table operations had been needed because
the vmalloc area had been set up by other things.
Apparently due to a recent change to the GHEI driver: commit
77b246b32b ("acpi: apei: check for pending errors when probing GHES
entries") 0day actually caught a case during bootup whenthe ioremap
called down to page allocation. But that recent change only showed the
symptom, it wasn't the root cause of the problem.
Hopefully it is limited to just that one driver.
If you need to access random physical memory, you either need to ioremap
in process context, or you need to use the FIXMAP facility to set one
particular fixmap entry to the required mapping - that can be done safely.
Cc: Borislav Petkov <bp@suse.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tyler Baicar <tbaicar@codeaurora.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent change to the PM QoS framework to introduce a proper
no constraint value overlooked to handle the devices which don't
implement PM QoS OPS. Runtime PM is one of the more severely
impacted subsystems, failing every attempt to runtime suspend
a device. This leads into some nasty second level issues like
probe failures and increased power consumption among other
things.
Fix this by adding a proper return value for devices that don't
implement PM QoS.
Fixes: 0cc2b4e5a0 (PM / QoS: Fix device resume latency PM QoS)
Signed-off-by: Tero Kristo <t-kristo@ti.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking fixes from David Miller:
1) Fix route leak in xfrm_bundle_create().
2) In mac80211, validate user rate mask before configuring it. From
Johannes Berg.
3) Properly enforce memory limits in fair queueing code, from Toke
Hoiland-Jorgensen.
4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet.
5) Fix TSO header allocation and management in mvpp2 driver, from Yan
Markman.
6) Don't take socket lock in BH handler in strparser code, from Tom
Herbert.
7) Don't show sockets from other namespaces in AF_UNIX code, from
Andrei Vagin.
8) Fix double free in error path of tap_open(), from Girish Moodalbail.
9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker
and Alexander Duyck.
10) Fix DCB mode programming in stmmac driver, from Jose Abreu.
11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin
Long.
12) Properly align SKB head before building SKB in tuntap, from Jason
Wang.
13) Avoid matching qdiscs with a zero handle during lookups, from Cong
Wang.
14) Fix various endianness bugs in sctp, from Xin Long.
15) Fix tc filter callback races and add selftests which trigger the
problem, from Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
selftests: Introduce a new test case to tc testsuite
selftests: Introduce a new script to generate tc batch file
net_sched: fix call_rcu() race on act_sample module removal
net_sched: add rtnl assertion to tcf_exts_destroy()
net_sched: use tcf_queue_work() in tcindex filter
net_sched: use tcf_queue_work() in rsvp filter
net_sched: use tcf_queue_work() in route filter
net_sched: use tcf_queue_work() in u32 filter
net_sched: use tcf_queue_work() in matchall filter
net_sched: use tcf_queue_work() in fw filter
net_sched: use tcf_queue_work() in flower filter
net_sched: use tcf_queue_work() in flow filter
net_sched: use tcf_queue_work() in cgroup filter
net_sched: use tcf_queue_work() in bpf filter
net_sched: use tcf_queue_work() in basic filter
net_sched: introduce a workqueue for RCU callbacks of tc filter
sctp: fix some type cast warnings introduced since very beginning
sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
sctp: fix some type cast warnings introduced by transport rhashtable
sctp: fix some type cast warnings introduced by stream reconf
...
Cong Wang says:
====================
net_sched: fix races with RCU callbacks
Recently, the RCU callbacks used in TC filters and TC actions keep
drawing my attention, they introduce at least 4 race condition bugs:
1. A simple one fixed by Daniel:
commit c78e1746d3
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed May 20 17:13:33 2015 +0200
net: sched: fix call_rcu() race on classifier module unloads
2. A very nasty one fixed by me:
commit 1697c4bb52
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon Sep 11 16:33:32 2017 -0700
net_sched: carefully handle tcf_block_put()
3. Two more bugs found by Chris:
https://patchwork.ozlabs.org/patch/826696/https://patchwork.ozlabs.org/patch/826695/
Usually RCU callbacks are simple, however for TC filters and actions,
they are complex because at least TC actions could be destroyed
together with the TC filter in one callback. And RCU callbacks are
invoked in BH context, without locking they are parallel too. All of
these contribute to the cause of these nasty bugs.
Alternatively, we could also:
a) Introduce a spinlock to serialize these RCU callbacks. But as I
said in commit 1697c4bb52 ("net_sched: carefully handle
tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
Potentially we need to do a lot of work to make it possible (if not
impossible).
b) Just get rid of these RCU callbacks, because they are not
necessary at all, callers of these call_rcu() are all on slow paths
and holding RTNL lock, so blocking is allowed in their contexts.
However, David and Eric dislike adding synchronize_rcu() here.
As suggested by Paul, we could defer the work to a workqueue and
gain the permission of holding RTNL again without any performance
impact, however, in tcf_block_put() we could have a deadlock when
flushing workqueue while hodling RTNL lock, the trick here is to
defer the work itself in workqueue and make it queued after all
other works so that we keep the same ordering to avoid any
use-after-free. Please see the first patch for details.
Patch 1 introduces the infrastructure, patch 2~12 move each
tc filter to the new tc filter workqueue, patch 13 adds
an assertion to catch potential bugs like this, patch 14
closes another rcu callback race, patch 15 and patch 16 add
new test cases.
====================
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patchset, we fixed a tc bug. This patch adds the test case
that reproduces the bug. To run this test case, user should specify
an existing NIC device:
# sudo ./tdc.py -d enp4s0f0
This test case belongs to category "flower". If user doesn't specify
a NIC device, the test cases belong to "flower" will not be run.
In this test case, we create 1M filters and all filters share the same
action. When destroying all filters, kernel should not panic. It takes
about 18s to run it.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
# ./tdc_batch.py -h
usage: tdc_batch.py [-h] [-n NUMBER] [-o] [-s] [-p] device file
TC batch file generator
positional arguments:
device device name
file batch file name
optional arguments:
-h, --help show this help message and exit
-n NUMBER, --number NUMBER
how many lines in batch file
-o, --skip_sw skip_sw (offload), by default skip_hw
-s, --share_action all filters share the same action
-p, --prio all filters have different prio
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit c78e1746d3
("net: sched: fix call_rcu() race on classifier module unloads"),
we need to wait for flying RCU callback tcf_sample_cleanup_rcu().
Cc: Yotam Gigi <yotamg@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After previous patches, it is now safe to claim that
tcf_exts_destroy() is always called with RTNL lock.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a dedicated workqueue for tc filters
so that each tc filter's RCU callback could defer their
action destroy work to this workqueue. The helper
tcf_queue_work() is introduced for them to use.
Because we hold RTNL lock when calling tcf_block_put(), we
can not simply flush works inside it, therefore we have to
defer it again to this workqueue and make sure all flying RCU
callbacks have already queued their work before this one, in
other words, to ensure this is the last one to execute to
prevent any use-after-free.
On the other hand, this makes tcf_block_put() ugly and
harder to understand. Since David and Eric strongly dislike
adding synchronize_rcu(), this is probably the only
solution that could make everyone happy.
Please also see the code comments below.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long says:
====================
sctp: a bunch of fixes for some sparse warnings
As Eric noticed, when running 'make C=2 M=net/sctp/', a plenty of
warnings or errors checked by sparse appear. They are all problems
about Endian and type cast.
Most of them are just warnings by which no issues could be caused
while some might be bugs.
This patchset fixes them with four patches basically according to
how they are introduced.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
These warnings were found by running 'make C=2 M=net/sctp/'.
They are there since very beginning.
Note after this patch, there still one warning left in
sctp_outq_flush():
sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM)
Since it has been moved to sctp_stream_outq_migrate on net-next,
to avoid the extra job when merging net-next to net, I will post
the fix for it after the merging is done.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These warnings were found by running 'make C=2 M=net/sctp/'.
Commit d4d6fb5787 ("sctp: Try not to change a_rwnd when faking a
SACK from SHUTDOWN.") expected to use the peers old rwnd and add
our flight size to the a_rwnd. But with the wrong Endian, it may
not work as well as expected.
So fix it by converting to the right value.
Fixes: d4d6fb5787 ("sctp: Try not to change a_rwnd when faking a SACK from SHUTDOWN.")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These warnings were found by running 'make C=2 M=net/sctp/'.
They are introduced by not aware of Endian for the port when
coding transport rhashtable patches.
Fixes: 7fda702f93 ("sctp: use new rhlist interface on sctp transport rhashtable")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These warnings were found by running 'make C=2 M=net/sctp/'.
They are introduced by not aware of Endian when coding stream
reconf patches.
Since commit c0d8bab6ae ("sctp: add get and set sockopt for
reconf_enable") enabled stream reconf feature for users, the
Fixes tag below would use it.
Fixes: c0d8bab6ae ("sctp: add get and set sockopt for reconf_enable")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide found the following script triggers a NULL pointer
dereference:
ip l a name eth0 type dummy
tc q a dev eth0 parent :1 handle 1: htb
This is because for a freshly created netdevice noop_qdisc
is attached and when passing 'parent :1', kernel actually
tries to match the major handle which is 0 and noop_qdisc
has handle 0 so is matched by mistake. Commit 69012ae425
tries to fix a similar bug but still misses this case.
Handle 0 is not a valid one, should be just skipped. In
fact, kernel uses it as TC_H_UNSPEC.
Fixes: 69012ae425 ("net: sched: fix handling of singleton qdiscs with qdisc_hash")
Fixes: 59cc1f61f0 ("net: sched:convert qdisc linked list to hashtable")
Reported-by: Davide Caratti <dcaratti@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when migrating sock to another one in sctp_sock_migrate(), it only
resets owner sk for the data in receive queues, not the chunks on out
queues.
It would cause that data chunks length on the sock is not consistent
with sk sk_wmem_alloc. When closing the sock or freeing these chunks,
the old sk would never be freed, and the new sock may crash due to
the overflow sk_wmem_alloc.
syzbot found this issue with this series:
r0 = socket$inet_sctp()
sendto$inet(r0)
listen(r0)
accept4(r0)
close(r0)
Although listen() should have returned error when one TCP-style socket
is in connecting (I may fix this one in another patch), it could also
be reproduced by peeling off an assoc.
This issue is there since very beginning.
This patch is to reset owner sk for the chunks on out queues so that
sk sk_wmem_alloc has correct value after accept one sock or peeloff
an assoc to one sock.
Note that when resetting owner sk for chunks on outqueue, it has to
sctp_clear_owner_w/skb_orphan chunks before changing assoc->base.sk
first and then sctp_set_owner_w them after changing assoc->base.sk,
due to that sctp_wfree and it's callees are using assoc->base.sk.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend says:
====================
net: sockmap fixes
Last two fixes (as far as I know) for sockmap code this round.
First, we are using the qdisc cb structure when making the data end
calculation. This is really just wrong so, store it with the other
metadata in the correct tcp_skb_cb sturct to avoid breaking things.
Next, with recent work to attach multiple programs to a cgroup a
specific enumeration of return codes was agreed upon. However,
I wrote the sk_skb program types before seeing this work and used
a different convention. Patch 2 in the series aligns the return
codes to avoid breaking with this infrastructure and also aligns
with other programming conventions to avoid being the odd duck out
forcing programs to remember SK_SKB programs are different. Pusing
to net because its a user visible change. With this SK_SKB program
return codes are the same as other cgroup program types.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent additions to support multiple programs in cgroups impose
a strict requirement, "all yes is yes, any no is no". To enforce
this the infrastructure requires the 'no' return code, SK_DROP in
this case, to be 0.
To apply these rules to SK_SKB program types the sk_actions return
codes need to be adjusted.
This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove
SK_ABORTED to remove any chance that the API may allow aborted
program flows to be passed up the stack. This would be incorrect
behavior and allow programs to break existing policies.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SK_SKB program types use bpf_compute_data to store the end of the
packet data. However, bpf_compute_data assumes the cb is stored in the
qdisc layer format. But, for SK_SKB this is the wrong layer of the
stack for this type.
It happens to work (sort of!) because in most cases nothing happens
to be overwritten today. This is very fragile and error prone.
Fortunately, we have another hole in tcp_skb_cb we can use so lets
put the data_end value there.
Note, SK_SKB program types do not use data_meta, they are failed by
sk_skb_is_valid_access().
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>