A common issue in Makefile is a race in parallel building.
You need to be careful to prevent multiple threads from writing to the
same file simultaneously.
Commit 3939f33450 ("ARM: 8418/1: add boot image dependencies to not
generate invalid images") addressed such a bad scenario.
A similar symptom occurs with the following command:
$ make -j$(nproc) ARCH=riscv Image Image.gz loader loader.bin vmlinuz.efi
[ snip ]
SORTTAB vmlinux
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
GZIP arch/riscv/boot/Image.gz
AS arch/riscv/boot/loader.o
AS arch/riscv/boot/loader.o
Kernel: arch/riscv/boot/Image is ready
PAD arch/riscv/boot/vmlinux.bin
GZIP arch/riscv/boot/vmlinuz
Kernel: arch/riscv/boot/loader is ready
OBJCOPY arch/riscv/boot/loader.bin
Kernel: arch/riscv/boot/loader.bin is ready
Kernel: arch/riscv/boot/Image.gz is ready
OBJCOPY arch/riscv/boot/vmlinuz.o
LD arch/riscv/boot/vmlinuz.efi.elf
OBJCOPY arch/riscv/boot/vmlinuz.efi
Kernel: arch/riscv/boot/vmlinuz.efi is ready
The log "OBJCOPY arch/riscv/boot/Image" is displayed 5 times.
(also "AS arch/riscv/boot/loader.o" twice.)
It indicates that 5 threads simultaneously enter arch/riscv/boot/
and write to arch/riscv/boot/Image.
It occasionally leads to a build failure:
$ make -j$(nproc) ARCH=riscv Image Image.gz loader loader.bin vmlinuz.efi
[ snip ]
SORTTAB vmlinux
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
OBJCOPY arch/riscv/boot/Image
PAD arch/riscv/boot/vmlinux.bin
truncate: Invalid number: 'arch/riscv/boot/vmlinux.bin'
make[2]: *** [drivers/firmware/efi/libstub/Makefile.zboot:13: arch/riscv/boot/vmlinux.bin] Error 1
make[2]: *** Deleting file 'arch/riscv/boot/vmlinux.bin'
make[1]: *** [arch/riscv/Makefile:167: vmlinuz.efi] Error 2
make[1]: *** Waiting for unfinished jobs....
Kernel: arch/riscv/boot/Image is ready
GZIP arch/riscv/boot/Image.gz
AS arch/riscv/boot/loader.o
AS arch/riscv/boot/loader.o
Kernel: arch/riscv/boot/loader is ready
OBJCOPY arch/riscv/boot/loader.bin
Kernel: arch/riscv/boot/loader.bin is ready
Kernel: arch/riscv/boot/Image.gz is ready
make: *** [Makefile:234: __sub-make] Error 2
Image.gz, loader, vmlinuz.efi depend on Image. loader.bin depends
on loader. Such dependencies are not specified in arch/riscv/Makefile.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20231119100024.2370992-1-masahiroy@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Björn Töpel <bjorn@kernel.org> says:
This series includes a three ftrace improvements for RISC-V:
1. Do not require to run recordmcount at build time (patch 1)
2. Simplification of the function graph functionality (patch 2)
3. Enable DYNAMIC_FTRACE_WITH_DIRECT_CALLS (patch 3 and 4)
The series has been tested on Qemu/rv64 virt/Debian sid with the
following test configs:
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_SAMPLE_FTRACE_DIRECT=m
CONFIG_SAMPLE_FTRACE_DIRECT_MULTI=m
CONFIG_SAMPLE_FTRACE_OPS=m
All tests pass.
* b4-shazam-merge:
samples: ftrace: Add RISC-V support for SAMPLE_FTRACE_DIRECT[_MULTI]
riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support
riscv: ftrace: Make function graph use ftrace directly
riscv: select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
Link: https://lore.kernel.org/r/20231130121531.1178502-1-bjorn@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Select the DYNAMIC_FTRACE_WITH_DIRECT_CALLS to provide the
register_ftrace_direct[_multi] interfaces allowing users to register
the customed trampoline (direct_caller) as the mcount for one or more
target functions. And modify_ftrace_direct[_multi] are also provided
for modifying direct_caller.
To make the direct_caller and the other ftrace hooks (e.g.
function/fgraph tracer, k[ret]probes) co-exist, a temporary register
is nominated to store the address of direct_caller in
ftrace_regs_caller. After the setting of the address direct_caller by
direct_ops->func and the RESTORE_REGS in ftrace_regs_caller,
direct_caller will be jumped to by the `jr` inst.
Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support for RISC-V.
Signed-off-by: Song Shuai <suagrfillet@gmail.com>
Tested-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Guo Ren <guoren@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20231130121531.1178502-4-bjorn@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Similar to commit 0c0593b45c ("x86/ftrace: Make function graph use
ftrace directly") and commit c4a0ebf87c ("arm64/ftrace: Make
function graph use ftrace directly"), RISC-V has no need for a special
graph tracer hook. The graph_ops::func function can be used to install
the return_hooker.
This cleanup only changes the FTRACE_WITH_REGS implementation, leaving
the mcount-based implementation is unaffected.
Perform the simplification, and also cleanup the register save/restore
macros.
Signed-off-by: Song Shuai <suagrfillet@gmail.com>
Tested-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Guo Ren <guoren@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20231130121531.1178502-3-bjorn@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Nathan Chancellor <nathan@kernel.org> says:
This series disables DWARF5 for LLVM versions where it is known to be
broken due to linker relaxation.
* b4-shazam-merge:
lib/Kconfig.debug: Update AS_HAS_NON_CONST_LEB128 comment and name
riscv: Restrict DWARF5 when building with LLVM to known working versions
riscv: Hoist linker relaxation disabling logic into Kconfig
Link: bbc0f99f3b
Link: https://lore.kernel.org/r/20231205-riscv-restrict-dwarf5-llvm-v2-0-aedf00a382ac@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
LLVM prior to 18.0.0 would generate incorrect debug info for DWARF5 due
to linker relaxation, which was worked around in clang by defaulting
RISC-V to DWARF4 [1]. Unfortunately, this workaround does not work for
the kernel because the DWARF version can be independently changed from
the default in Kconfig.
Do not allow DWARF5 to be selected for RISC-V when using linker
relaxation (ld.lld >= 15.0.0) and a version of LLVM that does not have
the fixes (the integrated assembler [2] and ld.lld [3] < 18.0.0)
necessary to generate the correct debug info.
Link: bbc0f99f3b [1]
Link: 1df5ea29b4 [2]
Link: 7ffabb61a5 [3]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/r/20231205-riscv-restrict-dwarf5-llvm-v2-2-aedf00a382ac@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Certain configurations may need to be disabled if linker relaxation is
in use, such as DWARF5 with ld.lld < 18. Hoist the logic of whether or
not linker relaxation is in use into Kconfig so decisions can be made at
configuration time.
Reviewed-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20231205-riscv-restrict-dwarf5-llvm-v2-1-aedf00a382ac@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Charlie Jenkins <charlie@rivosinc.com> says:
Each architecture generally implements fine-tuned checksum functions to
leverage the instruction set. This patch adds the main checksum
functions that are used in networking. Tested on QEMU, this series
allows the CHECKSUM_KUNIT tests to complete an average of 50.9% faster.
This patch takes heavy use of the Zbb extension using alternatives
patching.
To test this patch, enable the configs for KUNIT, then CHECKSUM_KUNIT.
I have attempted to make these functions as optimal as possible, but I
have not ran anything on actual riscv hardware. My performance testing
has been limited to inspecting the assembly, running the algorithms on
x86 hardware, and running in QEMU.
ip_fast_csum is a relatively small function so even though it is
possible to read 64 bits at a time on compatible hardware, the
bottleneck becomes the clean up and setup code so loading 32 bits at a
time is actually faster.
* b4-shazam-merge:
kunit: Add tests for csum_ipv6_magic and ip_fast_csum
riscv: Add checksum library
riscv: Add checksum header
riscv: Add static key for misaligned accesses
asm-generic: Improve csum_fold
Link: https://lore.kernel.org/r/20240108-optimize_checksum-v15-0-1c50de5f2167@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
will load from the buffer in groups of 32 bits, and when compiled for
64-bit will load in groups of 64 bits.
Additionally provide riscv optimized implementation of csum_ipv6_magic.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Xiao Wang <xiao.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240108-optimize_checksum-v15-4-1c50de5f2167@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Provide checksum algorithms that have been designed to leverage riscv
instructions such as rotate. In 64-bit, can take advantage of the larger
register to avoid some overflow checking.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Xiao Wang <xiao.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240108-optimize_checksum-v15-3-1c50de5f2167@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Support static branches depending on the value of misaligned accesses.
This will be used by a later patch in the series. At any point in time,
this static branch will only be enabled if all online CPUs are
considered "fast".
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Evan Green <evan@rivosinc.com>
Link: https://lore.kernel.org/r/20240108-optimize_checksum-v15-2-1c50de5f2167@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Andy Chiu <andy.chiu@sifive.com> says:
This series provides support running Vector in kernel mode.
Additionally, kernel-mode Vector can be configured to run without
turnning off preemption on a CONFIG_PREEMPT kernel. Along with the
suport, we add Vector optimized copy_{to,from}_user. And provide a
simple threshold to decide when to run the vectorized functions.
We decided to drop vectorized memcpy/memset/memmove for the moment due
to the concern of memory side-effect in kernel_vector_begin(). The
detailed description can be found at v9[0]
This series is composed by 4 parts:
patch 1-4: adds basic support for kernel-mode Vector
patch 5: includes vectorized copy_{to,from}_user into the kernel
patch 6: refactor context switch code in fpu [1]
patch 7-10: provides some code refactors and support for preemptible
kernel-mode Vector.
This series can be merged if we feel any part of {1~4, 5, 6, 7~10} is
mature enough.
This patch is tested on a QEMU with V and verified that booting, normal
userspace operations all work as usual with thresholds set to 0. Also,
we test by launching multiple kernel threads which continuously executes
and verifies Vector operations in the background. The module that tests
these operation is expected to be upstream later.
* b4-shazam-merge:
riscv: vector: allow kernel-mode Vector with preemption
riscv: vector: use kmem_cache to manage vector context
riscv: vector: use a mask to write vstate_ctrl
riscv: vector: do not pass task_struct into riscv_v_vstate_{save,restore}()
riscv: fpu: drop SR_SD bit checking
riscv: lib: vectorize copy_to_user/copy_from_user
riscv: sched: defer restoring Vector context for user
riscv: Add vector extension XOR implementation
riscv: vector: make Vector always available for softirq context
riscv: Add support for kernel mode vector
Link: https://lore.kernel.org/r/20240115055929.4736-1-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add kernel_vstate to keep track of kernel-mode Vector registers when
trap introduced context switch happens. Also, provide riscv_v_flags to
let context save/restore routine track context status. Context tracking
happens whenever the core starts its in-kernel Vector executions. An
active (dirty) kernel task's V contexts will be saved to memory whenever
a trap-introduced context switch happens. Or, when a softirq, which
happens to nest on top of it, uses Vector. Context retoring happens when
the execution transfer back to the original Kernel context where it
first enable preempt_v.
Also, provide a config CONFIG_RISCV_ISA_V_PREEMPTIVE to give users an
option to disable preemptible kernel-mode Vector at build time. Users
with constraint memory may want to disable this config as preemptible
kernel-mode Vector needs extra space for tracking of per thread's
kernel-mode V context. Or, users might as well want to disable it if all
kernel-mode Vector code is time sensitive and cannot tolerate context
switch overhead.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-11-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The allocation size of thread.vstate.datap is always riscv_v_vsize. So
it is possbile to use kmem_cache_* to manage the allocation. This gives
users more information regarding allocation of vector context via
/proc/slabinfo. And it potentially reduces the latency of the first-use
trap because of the allocation caches.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-10-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
riscv_v_ctrl_set() should only touch bits within
PR_RISCV_V_VSTATE_CTRL_MASK. So, use the mask when we really set task's
vstate_ctrl.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-9-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
riscv_v_vstate_{save,restore}() can operate only on the knowlege of
struct __riscv_v_ext_state, and struct pt_regs. Let the caller decides
which should be passed into the function. Meanwhile, the kernel-mode
Vector is going to introduce another vstate, so this also makes functions
potentially able to be reused.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-8-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
SR_SD summarizes the dirty status of FS/VS/XS. However, the current code
structure does not fully utilize it because each extension specific code
is divided into an individual segment. So remove the SR_SD check for
now.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Guo Ren <guoren@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-7-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This patch utilizes Vector to perform copy_to_user/copy_from_user. If
Vector is available and the size of copy is large enough for Vector to
perform better than scalar, then direct the kernel to do Vector copies
for userspace. Though the best programming practice for users is to
reduce the copy, this provides a faster variant when copies are
inevitable.
The optimal size for using Vector, copy_to_user_thres, is only a
heuristic for now. We can add DT parsing if people feel the need of
customizing it.
The exception fixup code of the __asm_vector_usercopy must fallback to
the scalar one because accessing user pages might fault, and must be
sleepable. Current kernel-mode Vector does not allow tasks to be
preemptible, so we must disactivate Vector and perform a scalar fallback
in such case.
The original implementation of Vector operations comes from
https://github.com/sifive/sifive-libc, which we agree to contribute to
Linux kernel.
Co-developed-by: Jerry Shih <jerry.shih@sifive.com>
Signed-off-by: Jerry Shih <jerry.shih@sifive.com>
Co-developed-by: Nick Knight <nick.knight@sifive.com>
Signed-off-by: Nick Knight <nick.knight@sifive.com>
Suggested-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-6-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
User will use its Vector registers only after the kernel really returns
to the userspace. So we can delay restoring Vector registers as long as
we are still running in kernel mode. So, add a thread flag to indicates
the need of restoring Vector and do the restore at the last
arch-specific exit-to-user hook. This save the context restoring cost
when we switch over multiple processes that run V in kernel mode. For
example, if the kernel performs a context swicth from A->B->C, and
returns to C's userspace, then there is no need to restore B's
V-register.
Besides, this also prevents us from repeatedly restoring V context when
executing kernel-mode Vector multiple times.
The cost of this is that we must disable preemption and mark vector as
busy during vstate_{save,restore}. Because then the V context will not
get restored back immediately when a trap-causing context switch happens
in the middle of vstate_{save,restore}.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-5-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This patch adds support for vector optimized XOR and it is tested in
qemu.
Co-developed-by: Han-Kuan Chen <hankuan.chen@sifive.com>
Signed-off-by: Han-Kuan Chen <hankuan.chen@sifive.com>
Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-4-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The goal of this patch is to provide full support of Vector in kernel
softirq context. So that some of the crypto alogrithms won't need scalar
fallbacks.
By disabling bottom halves in active kernel-mode Vector, softirq will
not be able to nest on top of any kernel-mode Vector. So, softirq
context is able to use Vector whenever it runs.
After this patch, Vector context cannot start with irqs disabled.
Otherwise local_bh_enable() may run in a wrong context.
Disabling bh is not enough for RT-kernel to prevent preeemption. So
we must disable preemption, which also implies disabling bh on RT.
Related-to: commit 696207d425 ("arm64/sve: Make kernel FPU protection RT friendly")
Related-to: commit 66c3ec5a71 ("arm64: neon: Forbid when irqs are disabled")
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-3-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add kernel_vector_begin() and kernel_vector_end() function declarations
and corresponding definitions in kernel_mode_vector.c
These are needed to wrap uses of vector in kernel mode.
Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20240115055929.4736-2-andy.chiu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
guoren@kernel.org <guoren@kernel.org> says:
From: Guo Ren <guoren@linux.alibaba.com>
When the task is in COMPAT mode, the TASK_SIZE should be 2GB, so
STACK_TOP_MAX and arch_get_mmap_end must be limited to 2 GB. This series
fixes the problem made by commit: add2cc6b65 ("RISC-V: mm: Restrict
address space for sv39,sv48,sv57") and optimizes the related coding
convention of TASK_SIZE.
* b4-shazam-merge:
riscv: mm: Fixup compat arch_get_mmap_end
riscv: mm: Fixup compat mode boot failure
Link: https://lore.kernel.org/r/20231222115703.2404036-1-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
When the task is in COMPAT mode, the arch_get_mmap_end should be 2GB,
not TASK_SIZE_64. The TASK_SIZE has contained is_compat_mode()
detection, so change the definition of STACK_TOP_MAX to TASK_SIZE
directly.
Cc: stable@vger.kernel.org
Fixes: add2cc6b65 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57")
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20231222115703.2404036-3-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
In COMPAT mode, the STACK_TOP is DEFAULT_MAP_WINDOW (0x80000000), but
the TASK_SIZE is 0x7fff000. When the user stack is upon 0x7fff000, it
will cause a user segment fault. Sometimes, it would cause boot
failure when the whole rootfs is rv32.
Freeing unused kernel image (initmem) memory: 2236K
Run /sbin/init as init process
Starting init: /sbin/init exists but couldn't execute it (error -14)
Run /etc/init as init process
...
Increase the TASK_SIZE to cover STACK_TOP.
Cc: stable@vger.kernel.org
Fixes: add2cc6b65 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57")
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20231222115703.2404036-2-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The ending NULL is not taken into account by strncat(), so switch to
strlcat() to correctly compute the size of the available memory when
appending CONFIG_CMDLINE to 'early_cmdline'.
Fixes: 26e7aacb83 ("riscv: Allow to downgrade paging mode from the command line")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/9f66d2b58c8052d4055e90b8477ee55d9a0914f9.1698564026.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Allow to defer the flushing of the TLB when unmapping pages, which allows
to reduce the numbers of IPI and the number of sfence.vma.
The ubenchmarch used in commit 43b3dfdd04 ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration") that
was multithreaded to force the usage of IPI shows good performance
improvement on all platforms:
* Unmatched: ~34%
* TH1520 : ~78%
* Qemu : ~81%
In addition, perf on qemu reports an important decrease in time spent
dealing with IPIs:
Before: 68.17% main [kernel.kallsyms] [k] __sbi_rfence_v02_call
After : 8.64% main [kernel.kallsyms] [k] __sbi_rfence_v02_call
* Benchmark:
int stick_this_thread_to_core(int core_id) {
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores)
return EINVAL;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_t current_thread = pthread_self();
return pthread_setaffinity_np(current_thread,
sizeof(cpu_set_t), &cpuset);
}
static void *fn_thread (void *p_data)
{
int ret;
pthread_t thread;
stick_this_thread_to_core((int)p_data);
while (1) {
sleep(1);
}
return NULL;
}
int main()
{
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
pthread_t threads[4];
int ret;
for (int i = 0; i < 4; ++i) {
ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i);
if (ret)
{
printf("%s", strerror (ret));
}
}
memset(p, 0x88, SIZE);
for (int k = 0; k < 10000; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += 4096) {
(void)p[i];
}
/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
for (int i = 0; i < 4; i++)
{
pthread_cancel(threads[i]);
}
for (int i = 0; i < 4; i++)
{
pthread_join(threads[i], NULL);
}
return 0;
}
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Jisheng Zhang <jszhang@kernel.org>
Tested-by: Jisheng Zhang <jszhang@kernel.org> # Tested on TH1520
Tested-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20240108193640.344929-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This will allow better TLB utilization and then should be more performant.
Before:
---[ vmemmap start ]---
0xffff8d8002000000-0xffff8d8012000000 0x000000046ec00000 256M PTE . .. .. D A G . . W R V
---[ vmemmap end ]---
After:
---[ vmemmap start ]---
0xffff8d8002000000-0xffff8d8012000000 0x000000046ec00000 256M PMD . .. .. D A G . . W R V
---[ vmemmap end ]---
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20231214132935.212864-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jisheng Zhang <jszhang@kernel.org> says:
Previously, we use alternative mechanism to dynamically patch
the CMO operations for THEAD C906/C910 during boot for performance
reason. But as pointed out by Arnd, "there is already a significant
cost in accessing the invalidated cache lines afterwards, which is
likely going to be much higher than the cost of an indirect branch".
And indeed, there's no performance difference with GMAC and EMMC per
my test on Sipeed Lichee Pi 4A board.
Use riscv_nonstd_cache_ops for THEAD C906/C910 CMO to simplify
the alternative code, and to acchieve Arnd's goal -- "I think
moving the THEAD ops at the same level as all nonstandard operations
makes sense, but I'd still leave CMO as an explicit fast path that
avoids the indirect branch. This seems like the right thing to do both
for readability and for platforms on which the indirect branch has a
noticeable overhead."
To make bisect easy, I use two patches here: patch1 does the conversion
which just mimics current CMO behavior via. riscv_nonstd_cache_ops, I
assume no functionalities changes. patch2 uses T-HEAD PA based CMO
instructions so that we don't need to covert PA to VA.
* b4-shazam-merge:
riscv: errata: thead: use pa based instructions for CMO
riscv: errata: thead: use riscv_nonstd_cache_ops for CMO
Link: https://lore.kernel.org/r/20231114143338.2406-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Anup Patel <apatel@ventanamicro.com> says:
The SBI v2.0 specification is now frozen. The SBI v2.0 specification defines
SBI debug console (DBCN) extension which replaces the legacy SBI v0.1
functions sbi_console_putchar() and sbi_console_getchar().
(Refer v2.0-rc5 at https://github.com/riscv-non-isa/riscv-sbi-doc/releases)
This series adds support for SBI debug console (DBCN) extension in
Linux RISC-V.
To try these patches with KVM RISC-V, use KVMTOOL from the
riscv_zbx_zicntr_smstateen_condops_v1 branch at:
https://github.com/avpatel/kvmtool.git
* b4-shazam-merge:
RISC-V: Enable SBI based earlycon support
tty: Add SBI debug console support to HVC SBI driver
tty/serial: Add RISC-V SBI debug console based earlycon
RISC-V: Add SBI debug console helper routines
RISC-V: Add stubs for sbi_console_putchar/getchar()
Link: https://lore.kernel.org/r/20231124070905.1043092-1-apatel@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
When the SUSP SBI extension is present it implies that the standard
"suspend to RAM" type is available. Wire it up to the generic
platform suspend support, also applying the already present support
for non-retentive CPU suspend. When the kernel is built with
CONFIG_SUSPEND, one can do 'echo mem > /sys/power/state' to suspend.
Resumption will occur when a platform-specific wake-up event arrives.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20231206110807.35882-4-ajones@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Charlie Jenkins <charlie@rivosinc.com> says:
When modules are loaded while there is not ample allocatable memory,
there was previously not proper error handling. This series fixes a
use-after-free error and a different issue that caused a non graceful
exit after memory was not properly allocated.
* b4-shazam-merge:
riscv: Fix relocation_hashtable size
riscv: Correctly free relocation hashtable on error
riscv: Fix module loading free order
Link: https://lore.kernel.org/r/20240104-module_loading_fix-v3-0-a71f8de6ce0f@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jisheng Zhang <jszhang@kernel.org> says:
Some riscv implementations such as T-HEAD's C906, C908, C910 and C920
support efficient unaligned access, for performance reason we want
to enable HAVE_EFFICIENT_UNALIGNED_ACCESS on these platforms. To
avoid performance regressions on non efficient unaligned access
platforms, HAVE_EFFICIENT_UNALIGNED_ACCESS can't be globally selected.
To solve this problem, runtime code patching based on the detected
speed is a good solution. But that's not easy, it involves lots of
work to modify vairous subsystems such as net, mm, lib and so on.
This can be done step by step.
So let's take an easier solution: add support to efficient unaligned
access and hide the support under NONPORTABLE.
patch1 introduces RISCV_EFFICIENT_UNALIGNED_ACCESS which depends on
NONPORTABLE, if users know during config time that the kernel will be
only run on those efficient unaligned access hw platforms, they can
enable it. Obviously, generic unified kernel Image shouldn't enable it.
patch2 adds support DCACHE_WORD_ACCESS when MMU and
RISCV_EFFICIENT_UNALIGNED_ACCESS.
Below test program and step shows how much performance can be improved:
$ cat tt.c
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define ITERATIONS 1000000
#define PATH "123456781234567812345678123456781"
int main(void)
{
unsigned long i;
struct stat buf;
for (i = 0; i < ITERATIONS; i++)
stat(PATH, &buf);
return 0;
}
$ gcc -O2 tt.c
$ touch 123456781234567812345678123456781
$ time ./a.out
Per my test on T-HEAD C910 platforms, the above test performance is
improved by about 7.5%.
* b4-shazam-merge:
riscv: select DCACHE_WORD_ACCESS for efficient unaligned access HW
riscv: introduce RISCV_EFFICIENT_UNALIGNED_ACCESS
Link: https://lore.kernel.org/r/20231225044207.3821-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
T-HEAD CPUs such as C906/C910/C920 support phy address based CMO, use
them so that we don't need to convert to virt address.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20231114143338.2406-3-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Previously, we use alternative mechanism to dynamically patch
the CMO operations for THEAD C906/C910 during boot for performance
reason. But as pointed out by Arnd, "there is already a significant
cost in accessing the invalidated cache lines afterwards, which is
likely going to be much higher than the cost of an indirect branch".
And indeed, there's no performance difference with GMAC and EMMC per
my test on Sipeed Lichee Pi 4A board.
Use riscv_nonstd_cache_ops for THEAD C906/C910 CMO to simplify
the alternative code, and to acchieve Arnd's goal -- "I think
moving the THEAD ops at the same level as all nonstandard operations
makes sense, but I'd still leave CMO as an explicit fast path that
avoids the indirect branch. This seems like the right thing to do both
for readability and for platforms on which the indirect branch has a
noticeable overhead."
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20231114143338.2406-2-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Let us enable SBI based earlycon support in defconfig for both RV32
and RV64 so that "earlycon=sbi" can be used again.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20231124070905.1043092-6-apatel@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Let us provide SBI debug console helper routines which can be
shared by serial/earlycon-riscv-sbi.c and hvc/hvc_riscv_sbi.c.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20231124070905.1043092-3-apatel@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The functions sbi_console_putchar() and sbi_console_getchar() are
not defined when CONFIG_RISCV_SBI_V01 is disabled so let us add
stub of these functions to avoid "#ifdef" on user side.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20231124070905.1043092-2-apatel@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
When there is not enough allocatable memory for the relocation
hashtable, module loading should exit gracefully. Previously, this was
attempted to be accomplished by checking if an unsigned number is less
than zero which does not work. Instead have the caller check if the
hashtable was correctly allocated and add a comment explaining that
hashtable_bits that is 0 is valid.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Fixes: d8792a5734 ("riscv: Safely remove entries from relocation list")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202312132019.iYGTwW0L-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202312120044.wTI1Uyaa-lkp@intel.com/
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/20240104-module_loading_fix-v3-2-a71f8de6ce0f@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
DCACHE_WORD_ACCESS uses the word-at-a-time API for optimised string
comparisons in the vfs layer.
This patch implements support for load_unaligned_zeropad in much the
same way as has been done for arm64.
Here is the test program and step:
$ cat tt.c
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define ITERATIONS 1000000
#define PATH "123456781234567812345678123456781"
int main(void)
{
unsigned long i;
struct stat buf;
for (i = 0; i < ITERATIONS; i++)
stat(PATH, &buf);
return 0;
}
$ gcc -O2 tt.c
$ touch 123456781234567812345678123456781
$ time ./a.out
Per my test on T-HEAD C910 platforms, the above test performance is
improved by about 7.5%.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Link: https://lore.kernel.org/r/20231225044207.3821-3-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Some riscv implementations such as T-HEAD's C906, C908, C910 and C920
support efficient unaligned access, for performance reason we want
to enable HAVE_EFFICIENT_UNALIGNED_ACCESS on these platforms. To
avoid performance regressions on other non efficient unaligned access
platforms, HAVE_EFFICIENT_UNALIGNED_ACCESS can't be globally selected.
To solve this problem, runtime code patching based on the detected
speed is a good solution. But that's not easy, it involves lots of
work to modify vairous subsystems such as net, mm, lib and so on.
This can be done step by step.
So let's take an easier solution: add support to efficient unaligned
access and hide the support under NONPORTABLE.
Now let's introduce RISCV_EFFICIENT_UNALIGNED_ACCESS which depends on
NONPORTABLE, if users know during config time that the kernel will be
only run on those efficient unaligned access hw platforms, they can
enable it. Obviously, generic unified kernel Image shouldn't enable it.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20231225044207.3821-2-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Clément Léger <cleger@rivosinc.com> says:
This series add support for a few more extensions that are present in
the RVA22U64/RVA23U64 (either mandatory or optional) and that are useful
for userspace:
- Zicond
- Zacas
- Ztso
Series currently based on riscv/for-next.
* b4-shazam-lts:
riscv: hwprobe: export Zicond extension
riscv: hwprobe: export Zacas ISA extension
riscv: add ISA extension parsing for Zacas
dt-bindings: riscv: add Zacas ISA extension description
riscv: hwprobe: export Ztso ISA extension
riscv: add ISA extension parsing for Ztso
Link: https://lore.kernel.org/r/20231220155723.684081-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>